How many reports can be supported with a standard dimensional data model?
I have been asked this question many times, with the questioner incorrectly claiming that doing any other reporting methods would give better results.
In this write up, I will examine some techniques recently
derived in order to precisely calculate the number of queries that can be
executed on a well-designed dimensional model. These techniques may or may not
be limited to the dimensional structures, expanding their use into 3rd
normal form or data vault is for a later exercise.
My original research on this topic focused on the
relationships between the tables themselves, while that research is interesting
in its own right covered in my papers on Data Structure Graphs, the method
described in this paper shows that using an adjacency matrix, similar to what
may be done with Graph calculations combined with knowing the column count of
the tables involved allow for a direct calculation of the number of possible
queries that could be supported by a dimensional model.
In the following calculations, the only columns we will be
referring to are those columns that are not keys. Surrogate, Primary, and Foreign
keys do not factor in to the calculations.
This is because, generally, in the dimensional model, these keys are used to join tables together. These expressions assume the keys used in the join clause of a given query will not be used in the Select clause.
This is because, generally, in the dimensional model, these keys are used to join tables together. These expressions assume the keys used in the join clause of a given query will not be used in the Select clause.
The method:
The method to calculate the total number of
queries that any “fact” table can support is to add up the number of columns
from all tables that can be joined (The number of columns in the dimensions
without the Primary or Foreign keys), plus the number of columns in the fact
table that act as measures (assume 1 for a “factless fact” table). Once you have
this number the formula is:
Where C is the total number of columns
that can be used on the select clause for this query. The “-1” is there because
even though there are a lot of combinations for a query you there is never the
condition that a query will select 0 columns.
Example:
One method to facilitate doing this quickly is to use R.
Select the count of all columns from the information schema for each table.
Load this data into a vector in R.
Create an adjacency matrix such as the following:
Dim_Date
|
Dim_Geo
|
Fct_Sale
|
Dim_Product
|
|
Dim_Date
|
1
|
0
|
1
|
0
|
Dim_Geo
|
0
|
1
|
1
|
0
|
Fct_Sale
|
1
|
1
|
1
|
0
|
Dim_Product
|
0
|
0
|
0
|
1
|
In the example above Fct_Sale joins with Dim_Date, and
Dim_Geo. Dim_Product we may use later.
For illustration purposes we will have a really small number
of columns
Table
|
Column1
|
Column2
|
Dim_Date
|
Year
|
Month
|
Dim_Geo
|
City
|
State
|
Fct_Sale
|
Sale
|
This would not really be used in production, but it serves
our purposes. Again, no keys are shown.
So the vector that represents the number of columns in these
tables is:
c(2,2,1)
The Vector that represents the column in the adjacency
matrix for this fact table is:
c(1,1,1)
So the R code to calculate the number of queries that can be
supported with these tables is:
vec1 <-c(2,2,1)
vec2 <- c(1,1,1)
2^sum(vec1*vec2) -1
A visual representation of all of these combinations is:
As you can see these combinations grow rather quickly. Just
using a Date, Geo, and product hierarchy that is slightly more complicated with
5,7, and 13 columns with a single measure for the fact table shows that there are up to a possible 33 Million queries.
To be clear, not all combinations of selects will provide value. Some of these 33 Million queries will make no sense. This method is a technique for calculating an upper bound on the possibilities.
To be clear, not all combinations of selects will provide value. Some of these 33 Million queries will make no sense. This method is a technique for calculating an upper bound on the possibilities.
Or spend more time focused on a single report whose
requirements will change later? Which technique is more robust?
As a follow up, I encourage others to try this technique on
a variety of data modeling techniques. It would be interesting to hear about an
environment where the source systems, the integration layer, and the
presentation layer were all measured using this technique.
There are some further use cases of this calculation that I
have yet to fully explore. For example, since these numbers can become so
astronomically high, it may be more practical to take the log of these numbers
and use this log number to represent the degree of complexity of a given fact
table. Then various fact tables may be compared for both complexity and
information representation. The higher numbers will be more complex, and have
more information represented. Combine these numbers with either your number of
users, or number of existing reports and that may be useful in some capacity
for calculating the return on investment of a data mart project.
While I specifically examine a dimensional model here, the adjacency matrix can be any table that is joining more than one logical concept (Customer, Order, Product, for example).
Please try this technique with your own databases, and get
in touch with your results. I very much look forward to others attempting these
techniques.
Good luck,
Doug
No comments:
Post a Comment