We have seen the concept of Junk Dimension in the preceding article and how it is developed and how it is helpful in the life-cycle of dimensional modelling.
Some people however have the concern about the performance of Junk Dimension namely that since we are putting together multiple entities , that will affect the performance and we will be worse of in creating a bigger dimension in terms of query performance.
In reality nothing could be farther from the truth !!! One of biggest performance nightmare in data warehousing systems is a low-cardinality joins and Junk dimension actually helps us in these cases by improving the performance & not by degrading it.
-Low-Cardinality Joins Concept
Low cardinality dimension is normally the ones where either the absolute number of total row values are very low (few 100 or less than 100) or if the cardinality of the reporting column (normally a hierarchy level) where most of reporting is done (for example in geography most frequently used level is country and there can be less than 10 values for countries for most companies).
When we design something as Junk Dimension we put together such low cardinality values. For example below is the details of the data profiling of the Junk dimension we created in previous article.
For example Funding group1/2 the cardinality is around 10-20 distinct values (arnd 2%) , for Functional group its from 1% on Group codes , also for Legal grp cardinality ranges from 1-2 distinct values for Blue Type (0.01% ) to around 100 distinct values (less than 0.4%) for company country. So by all definations this makes it a very low cardinality entities & so modeled as Junk.
These cardinality are so low in absolute terms , that it will not be properly supported by a normal B tree index (we can use Bitmap index but its not supported by all vendors & so a modelling avenue need to be sought) .
Now if these entities would have been modeled as separate physical entities then they result in an issue of low-cardinality joins. A low cardinality dim all by itself lowers the performance as database normally will treat these joins as NL (Nested Loop) joins , such subsequent 3-4 NL joins makes matter worse. There are another ways to tackle them , sometimes using a Bitmap index help since its a bit vector indexing technology and works well for low cardinality fields (creating B tree indexes on low cardinal fields is worse than not having any index) , but not all vendors specially DB2 still doesn’t support them. So grouping of low cardinality stuff together normally makes a lot of sense.
In case fact table will have 30-50 million rows to start with and will grow upto 50-80 million rows, compared to that NL(Nested Loop) joins with a very low cardinal entities (10-100) should be avoided or they should be directed towards SM (Sort Merge) joins or hash joins.
-How to avoid Nested Loop Joins for Low cardinal Data warehousing queries
For the performance of the queries , a pure junk dimension where all the data elements are cross-joined could reduce the number of physical DB joins needed to execute a query. If a query wants data from only one of the dimensions, then the performance is better nor worse since we have a single join for a single dimension. If a query wants data from more than one of the cross-joined dimensions, a pure junk dimension would improve the performance slightly by reducing the number of physical DB joins required.
Regarding the type of joins used, a database performance guide will say that the specific type of join (nested loop – NL, merge sort – SM, or hybrid) is determined by the optimizer at run-time based on the size of the respective tables, the predicates and join columns provided, and other factors. DB might also dynamically reorder the tables when it does the joins to ensure it selects the most efficient access path. It seems to me we should not be concerned here with the type of join because the DB system will handle that and select the most efficient path.
While this is true , Db optimizer would do it but its not so “intelligent” all the times. A Cartesian product join which is same as hash join normally performs best (specially if its using prefetch algorithm) and we should try to see it being used it more often., BUT db optimisor uses it only if it satisfies a “master-slave” data threshold normally. I had worked on pushing DB optimizer to use it as against other two joins , as most of time it wasn’t smart enough 🙂
In case where one table is very small & another very large , normally nested loop gets build upon by optimizer as we can see from explain plans. The problems of nested loop joins gets compounded in case “low cardinal” data (DW cases) where B-tree indexes doesn’t help. In this case we can drive it towards Sort Merge as B-tree index sort helps him or ideally hash joins .
One of the possible choices the DB optimizer could make is a Cartesian product join (effectively making a cross-joined pure junk dimension internally on the fly) . For those queries where it makes sense, the data would be cross-joined, but for other queries they would be accessed individually to improve performance
A key point on the usefulness of this approach is how often will the cross-joined dimensions be needed in the same query. The performance benefit comes greatly when you have a single join to retrieve multiple dimensions. In this model, the dimensions each show different views of the “organization”. If we will often be querying to see how many people are in one legal group and a different funding group then benefits are multi-fold.
The examples where I have seen cross-joins suggested by the RDBMS documentation is in the case of multiple medium cardinality (ie tens of hundred records) data-sets joins to a separate extremely high cardinality data-set (ie millions of records). This is somewhere in the middle is where our data falls and so why a Junk dimension helps in performance..
-Impact on Reports Prompts Performance
Some may wonder that the performance of the reports prompts would suffer. I said a listing of values for that dimension level (such as to populate a prompt list) would be slower. We are aware of both the end report performance and the prompt list performance since both will affect the end user perception and productivity.
Actually prompts queries are run using BI tool with help of distinct query rewrite and more often than not these values are cached. Since no joins are deployed in report prompts but a distinct query , an index on column is all what is needed for report prompts columns . Junk dimension neither makes report prompts queries better or worse and has no impact what soever.
-Co-relation of Junk Dimension & Performance
Decision of having a junk & querying joins plans or performance which are related but dont always drive each other . As such decision to include a junk is basically driven by design related points mentioned earlier as pro’s of using a junk .
The only additional pro point here i wanted to add is to avoid pure low cardinal joins & number of low cardinal data-sets (or table if you call it) in system. The way relational db optimizers normally approach it (with exception of Teradata which had OLAP specific architecture ) is to get access plans of ” first row out ” , and there it gives no so optimal performance . There are access plans full sets out but algorithms it uses of internal sorting makes it life no easier. Hashing algorithms are ones which works best in this cases :).
This is one of way cross joins are “suggested” to rdbms (by avoiding extreme low cardinal stuff), if there are some other specific “Data warehouse design methods” we can deploy to influence its optimizer , do share links with me , i would be interested in knowing more.
Here is the link if anyone is interested in reading more about the various join types and when DB2 uses them: http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2z10.doc.perf/src/tpc/db2z_plantablejoinexamples.htm