This article will try to explain what is junk dimension (also known as Mystery Dimension per Ralph Kimball new books) concept in the realm of dimensional modelling and why it is crucial to have at least one such junk dimension in every BI model.
-Concept of Junk Dimension
Junk Dimension simply put is the is an artificial combination of several unrelated objects together into a single table to help reduce design clutter. The several unrelated objects need to be of low cardinality where it does not make sense to make them into the separate dimensions. The common example for such candidates in any model is typically demographics that combines gender, race, year ranges, etc. Note how you can label this junk dimension as a single concept “demographics”.
This can be further extended to other low cardinality attributes in the data model where total number of dimensional rows could be less than 100 or 1000.
Example of Junk Dimension- Imagine a typical requirement where an organization is split into various parallel hierarchy Views based on various different departments like Funding Group , Functional Group , Legal Groups etc. Handling such different parallel views of a “single physical entity” is also a common issue in Dimensional data modelling. All these parallel views can be included in a single “Junk /Mystery” dimension and assigned a single Junk key. This is how it can look at logical data model layer.
-How to Populate the Junk Dimension
A junk dimension can be populated as cross join of all the entities . There is also an alternate way to assign role to each entity and populate the junk dimension as a “exclusive set” of drows of different entities. This can be done as follows.
Create dimensions for the 3/4 hierarchies. Add in a Null row in each hierarchy like you would for any table. Assign the dimension keys like normal in a staging fact table. from that staging table, select distinct combinations of the surrogate keys from 4 hierarchies. This is the base of the junk dimension. De-normalize the columns from the 4 hierarchy dimensions into the junk dimension (the junk has a dimension key). When you load the fact table, replace the 4 dimension keys from the hierarchies with the dimension key from the junk dimension.
A simple data view will look like this for above example
Use of Junk Dimension
The way junk dimension is used in the reporting layer typically congos is by splitting these diff small dimensions back into the different query subjects so for the end user the model appears with different objects. It is only at the data warehouse layer that the object are put together.
If we put together different entities and then split them apart why are we doing it ?
-Advantages of Junk Dimension
In a nutshell ,what does I had in mind when modelling Organization (say any “Business Entity” ) as junk ?
- Improves data warehousing performance for the low cardinality objects
- It does not affect end user perspectives whatsoever, he sees them as diff entities as “dimensional perspectives” as defined in Reporting layer will be implemented in a BI tool like cognos
- De-clutter design
- Track history /SCD2 in simplified & easier manner
- Can be scalable to all future requirements which can come at diff roles & at diff grains , which can be easily handled in a single data object.
- The another benefit normally derived from any junk & intended here is to make maintenance easier (not difficult !!). It is normally easy to maintain 1 SK (Surrogate Keys) instead of 3 , 1 SCD (slowly changing dimension) tracking instead of 3 (the joins will not be more complex for SCD as only the queries pertaining to role flag will be queried which is same as querying it as standalone entity , however etl jobs for these scd would be easier doing in once instead in 3 times ), and to maintain a single data object in database instead of 3.(More entities can be combined over three which can be even easier)
- Avoid making this model as “Centipede dimensional model” and is one of key reasons why we use junk dimension anywhere.
Lets look into how each of these advantages works out in practice in more finer details.
1. Performance Improvement
We will discuss this in more details in the next article as this has a larger scope.
2. Transparent to the User
This is the benefit which must be stressed since many times there is “political” opposition to creation of Junk dimension as people say those entities are not together. The explanation that must go to them that Junk dimension DOES NOT exist in end user layer or reporting layer at all and is completely transparent to the end user.
3. De-clutter design
Keeps the design simplified , one the basic principal of Ralph Kimball school of modelling methodology is to avoid the design clutter & junk dimension is thus a recommended approach.
4. Track history /SCD2 in simplified & easier manner
The method deployed by the ETL to populate & update the slowly changing dimensions (SCD) is pretty complex . In such a case a junk dimension which creates less number of SCD scenarios in a model is bound to make ETL coding & maintenance easier & better.
5. Scalable to All Future Requirements
This is the most important benefit of junk dimension and aprimary reason why every dimensional model should have one . We can look into it in more details.
5.1. Re-use in other data marts
Take above example .Suppose we are loading salary schema which does that have a legal group but only Functional & funding group , the same physical dim & same SK can be used there . Also say we want to model for hiring schema where people have funding grp & legal grp but no functional grp , then same two Org SK can be utilized.
5.2. Future Enhancement
Another thing is this junk can be extended to any new possibilities by just adding to “new key values (new rows) of existing SK ” and NOT introducing new SK to system. Adding a new SK entity to a live data warehouse system is normally impossible as its not possible to manage such a change Thus , In any model , one thing is always kept as junk dimension as mechanism to absorb any extensions & allow model to gracefully extend without altering key structure.
Instead of this if we add a new table to new future requirements then “Simply adding another table ” to the database to handle future req , in existing dimensional model will break the key integrity (& will add new keys) and such change is not allowed. (the only possible alternative to that will be to design a new data mart or star schema altogether again)
Adding a new model to existing dim will not be supported as it will add a new key to the system /fact table which would not be able to rationalize whereas all the prior data will be loaded wrt earlier key rationalisation process and two will not merge with each other. In case of junk dimension even though it looks like “three diff keys” , its actually one key . And same key can be reused for other possible extensions. Ignore the lables in structure and look at keys purely as ref objects.
Now some would ask—- If we add new data to the junk dimension we would need to add another key to the fact table as well.? It would be yet another role-playing key to the existing dimension but it would still be a new key and it seems to me this would be just as troublesome or concerning to the data model as adding a brand new dimension table with its own key. The only way a junk dimension would avoid this would be to have it contain all possible “future” combinations of the data attribute?How does it help then if an unforeseen change comes?
To this , Let me try & explain it other way round . It is also possible to meet current req by having “only one” physical key as org sk to fact table. This key will take into account all possible cross joins combinations into account for a given junk entities at a given point in time . ( This is the commonly followed approach in most of above technical books ) .
Now assume a scenario where a new entity is to be introduced. Now you re-caliberate junk dim with new entity values, the new junk will then be rationalized for existing SK values so as to handle both old & new entity values . This way same key is now addressing more information in itself than what it was previously doing by referring to new entities internally in junk . This does not alter fact table design. Its Junk SK now contains more information . The only that changes is data population of SK internally & still extends to new entities.
What will change is our perception & information of what a given “Junk Key” means – in physical world , same key will continue to exits without any disruptions. So “Functional grp key” /Funding Grp Key/Legal Grp key these are just labels for a “Junk Key” which in reality remains a entity.
5.3. Avoid Adding future surrogate values/
Let us address this point with the question –
Consider an example where we add a new dimension for hair color (we would never do this, but it is a simple example). We want to be able to have hierarchies, so this is a new dimension and not a new attribute on the Employee.
Without a junk dimension, this means adding a new dimension table (in the physical, logical, and end-user presentation layers). It also means updating the fact table to include a new key to the hair color dimension. To avoid outer joins, we should avoid optional keys from the fact to a dimension so this means we need to assign a value to every record – current and historical. If we further assume we don’t have any history data on this dimension then we’re stuck with two choices… make it an optional key and accept the performance impact, or create a ‘Unknown’ key on the dimension and map all the history records to that.
If we instead have a junk dimension, this means altering the dimension table (in the physical and logical layers), creating a new dimension in the end-user presentation layer, and altering the fact key label (in the logical and end-user layers). The ETL needs to be updated to create the cross-joins between the previous junk dimension and the new hair-color dimension. The problem still remains of how to assign the key to fact records. In this case we cannot make the join optional since there is other data for that join, so the only option is to create a ‘Unknown’ value in the dimension and map the history facts to the cross-join records with this Unknown key.
The response to above query is of how to deal with “Junk Dimension Maintenance”
We must note that in a dimesnional modelling world, A Null key in a fact is not allowed as default joins , cube etc. is build on that and all reporting tools expect a fact keys . So even if say we get a hair colour dimension request today & we create a new table for it and a new SK , we then alter DDL for fact , add a new key , with unknown till that date. Now assume a case , we will start our warehouse with nearly 6-10 million records in fact table (average size) , it would mean you go and UPDATE these millions records of fact table with that unknown key of new hair dimension. Normally as you would be aware , a fact table is NEVER allowed to update for obvious reasons (it makes even more impractical if data is partitioned) . Its insert only table by definition.
Now, Consider a junk dim , you contextualize your old keys to that unknown entity (of hair dim) and whenever new data is populated it will include new key values. This way we handle a change without any physical change to system , without any change to keys and without being a need to update a huge fact table.
6. Less Surrogate Keys
The method deployed by the ETL to populate & update the surrogate keys is pretty complex . In such a case a junk dimension which creates less number of surrogate keys is bound to make ETL coding & maintenance easier & better.
7. Avoid a Centipede Dimensional Model
Any dimensional model for a data mart which has more than 10/15 dimensions is considered a bad design & shows that architect has not done his job properly. One of the ways to avoid exceeding large number dimensions is to have a junk dimension.
From this discussion, we are coming to this understanding of junk dimensions:
- Avoids updates to the fact table (both the DDL and the existing records) when a new dimension is added
- Reduced number of dimensions/less cluttered diagram
- Handles a parallel views of a single entity in a manner which scalable & gracefully extendible
- Makes maintenance easier
- Improves data warehouse performance
We will explain in next article how a junk dimension helps to improve the performance of the data warehousing queries.