English: PNG image of a star schema (Photo credit: Wikipedia) |
Dimensional Modeling
When studying new concepts there is always a moment.
A single moment when a new concept is internalized.
We don't always remember this moment, but for me, learning about the star schema concept of dimensional modeling some time ago this moment was relatively easy to remember.
It was during the election of 2004.
Up until this moment, whenever I thought of a data model it was always some sort of Normal form.
De-normalization was not something that came naturally to me.
I had been a DBA (Database Administrator) at this point for over a decade (Let's not focus on how long I had been doing it at that point).
I had recently read The Data Warehouse Toolkit, by Ralph Kimball and Margy Ross. I was having a little difficulty applying this new found knowledge.
I was watching the election returns on the night of November 2, 2004. We had to run an errand to a car dealership that was located about 30 minutes away from home. So as we were driving, I began to play the scenario in my head asking myself this question:
How can they sum up so much diverse information so quickly?
Then I realized a few things:
- The data had predefined structure
- Most of the data was already known.
- The only thing that changed was counts.
So I designed my first star schema driving on the I-95 corridor election night 2004.
The Geo Dimension:
Predefined:
Geo_Key, State,county,city
The Candidate Dimension:
Candidate_Key,Party,Seat,CandidateName
The Voter Dimension:
Voter_Key, VoterID,registered_party, Voter Name
The Time Dimesion:
Time_Key, Hour, Minute
Fact table (Fct_Vote):
Geo_Key,Candidate_Key,Voter_key,Time_Key,1 as vote
Using these 5 tables, I realized I could run simple queries and get all of the same numbers that the major networks were displaying.
Select State,CandidateName,sum(vote) as votes
from Fct_Vote fv
inner join Dim_Geo dg on dg.geo_key=fv.geo_key
inner join Dim_Candidate dc on dc.candidate_key=fv.candidate_Key
group by State,CandidateName
This would rollup to the overall numbers.
This is an incredibly simple example of what could be done with a dimensional model for reporting, but it shows that decoupling the structure needed for an application (Voter registration system, and actual methods used for vote counting are defined by the states, and could possibly all be different), from the structure needed for reporting and analysis can make each system efficient in its own way.
Since we also capture the time a vote takes place we can study the growth paterns, of who voted when, and did they vote along party lines:
Select dc.CandidateName,dc,party,dv.registered_party,dt.hour,sum(vote)
from Fct_Vote fv
inner join Dim_Candidate dc on dc.candidate_key=fv.candidate_Key
inner join Dim_Voter dv on dv.voter_key = fv.voter_key
group by dc.CandidateName,dc,party,dv.registered_party,dt.hour
There is another pattern emerging here:
All the queries have the same structure, since the data model has a solid predefined structure. Store this in a relational database with proper indexing, tablespace structure and decent memory and this "simple" data model can answer almost any question posed to it, cache the results to speed up future questions, and serve many users.
This is the essence of Business Intelligence. Being able to quickly answer repeatable questions with new data.
Keep these structures the same and use your ETL process to load this structure with current data, and your report development effort becomes a question of displaying the data. As the data accumulates the same query simply shows new results.
This was one of the first steps in my journey to understanding that the structure of your data has meaning. Later on I would come to understand how to apply a Data Vault model as an integration layer before a star-schema in the presentation layer.
I continue to research this topic, and have written about my more recent application of Graph theory to both Data modeling and Enterprise Architecture in my book on Data Structure Graphs. I also write about Data Structure Graphs here on my blog
This is a great method for efficiently answering questions that are repeateable: How many widgets did we sell, based on region, quarter and store manager?
For doing statistical mining this structure may be a good starting point, but the data has to be collapsed into a single observation with all associated metrics,factors and con-founders associated with the observation for regression and other methods of non-trivial analysis.
Getting as much of your reporting data into this structure, and the supporting structures mentioned above is just a part of a Data Architecture plan.
This part is a good start, but it is only the beginning...
No comments:
Post a Comment