Pages

2016-04-21

Measuring Dimensional Models



How many reports can be supported with a standard dimensional data model?



I have been asked this question many times, with the questioner incorrectly claiming that doing any other reporting methods would give better results.
English: PNG image of a star schema

In this write up, I will examine some techniques recently derived in order to precisely calculate the number of queries that can be executed on a well-designed dimensional model. These techniques may or may not be limited to the dimensional structures, expanding their use into 3rd normal form or data vault is for a later exercise.

My original research on this topic focused on the relationships between the tables themselves, while that research is interesting in its own right covered in my papers on Data Structure Graphs, the method described in this paper shows that using an adjacency matrix, similar to what may be done with Graph calculations combined with knowing the column count of the tables involved allow for a direct calculation of the number of possible queries that could be supported by a dimensional model.


In the following calculations, the only columns we will be referring to are those columns that are not keys. Surrogate, Primary, and Foreign keys do not factor in to the calculations.  


This is because, generally, in the dimensional model, these keys are used to join tables together. These expressions assume the keys used in the join clause of a given query will not be used in the Select clause. 

The method:


The method to calculate the total number of queries that any “fact” table can support is to add up the number of columns from all tables that can be joined (The number of columns in the dimensions without the Primary or Foreign keys), plus the number of columns in the fact table that act as measures (assume 1 for a “factless fact” table). Once you have this number the formula is:


Where C is the total number of columns that can be used on the select clause for this query. The “-1” is there because even though there are a lot of combinations for a query you there is never the condition that a query will select 0 columns. 

Example:


One method to facilitate doing this quickly is to use R. Select the count of all columns from the information schema for each table. Load this data into a vector in R.

Create an adjacency matrix such as the following: 


Dim_Date
Dim_Geo
Fct_Sale
Dim_Product
Dim_Date
1
0
1
0
Dim_Geo
0
1
1
0
Fct_Sale
1
1
1
0
Dim_Product
0
0
0
1

In the example above Fct_Sale joins with Dim_Date, and Dim_Geo. Dim_Product we may use later.
For illustration purposes we will have a really small number of columns

Table
Column1
Column2
Dim_Date
Year
Month
Dim_Geo
City
State
Fct_Sale
Sale


This would not really be used in production, but it serves our purposes. Again, no keys are shown.
So the vector that represents the number of columns in these tables is:

c(2,2,1)
The Vector that represents the column in the adjacency matrix for this fact table is:
c(1,1,1)
So the R code to calculate the number of queries that can be supported with these tables is:

vec1 <-c(2,2,1)
vec2 <- c(1,1,1)
2^sum(vec1*vec2) -1

A visual representation of all of these combinations is: 


 



As you can see these combinations grow rather quickly. Just using a Date, Geo, and product hierarchy that is slightly more complicated with 5,7, and 13 columns with a single measure for the fact table shows that there are up to a possible 33 Million queries.

To be clear, not all combinations of selects will provide value. Some of these 33 Million queries will make no sense. This method is a technique for calculating an upper bound on the possibilities.

When starting a new reporting project, should you invest the time in proper data modeling techniques and reusable components that are well documented with dimensional modeling techniques? 


Or spend more time focused on a single report whose requirements will change later? Which technique is more robust?

As a follow up, I encourage others to try this technique on a variety of data modeling techniques. It would be interesting to hear about an environment where the source systems, the integration layer, and the presentation layer were all measured using this technique.

There are some further use cases of this calculation that I have yet to fully explore. For example, since these numbers can become so astronomically high, it may be more practical to take the log of these numbers and use this log number to represent the degree of complexity of a given fact table. Then various fact tables may be compared for both complexity and information representation. The higher numbers will be more complex, and have more information represented. Combine these numbers with either your number of users, or number of existing reports and that may be useful in some capacity for calculating the return on investment of a data mart project.


While I specifically examine a dimensional model here, the adjacency matrix can be any table that is joining more than one logical concept (Customer, Order, Product, for example).

Please try this technique with your own databases, and get in touch with your results. I very much look forward to others attempting these techniques.


Good luck,

Doug




No comments:

Post a Comment