Modelling Tip- Why it makes sense to ALWAYS have a surrogate key for Date Dim?

While there is normally a consensus about using a surrogate key in all the dimensions in a dimensional model , there are some who wonder if we should be using a surrogate key for a date dimension. Date dimension is used in every single data model so this question is quite typical. Here are some thoughts about why it makes sense to always use a surrogate key.

-Points against using surrogate key for Date Dimension

Many want to use a “Date” itself as the key for the date dimension and do not want to add a surrogate key.

Their concerns are
– A few purposes of a SK (Surrogate Key) would be to make the joins efficient, avoid changing meaning over time, or when there are multiple ways to view the same data. In this case, however, it has a DATE type key which is equally efficient for the joins, and is not going to change meanings over time. All our reporting is based on the Gregorian calendar system so we don’t need to handle alternate calendars or views of dates.

-A SK makes sense when the meaning of the natural value may change over time (ie – the name of a department may change), or when the natural value is difficult to join on (ie – String). In this case, a date will never change meaning (May 31 will always be May 31) and the natural value is just as efficient to join as an integer SK. In this case, there is no benefit to adding an SK to the model or the dimension.

-Why we ALWAYS need a surrogate key for a date dimension?

Date dimension does not undergo any SCD 1/2/3 so one of the reason for using a surrogate key for date dimension is not to maintain history. The reason for using it are driven more by the performance benefits and a better design in terms of data warehouse life cycle maintenance.

Let us look at the sample date dimension .

Date Dimension

Hierarchy Grouping – Two drills possible across Fiscal & Calendar Hierarchy as shown below.

Date Dimension Attribute Descriptions

Attribute Name
Attribute Description
Sample Values
Date Dim SK
Date Surrogate Key.
Sequence
Month Key
Key to identify each month uniquely
Year Key
Key to identify each year uniquely
Day
The specific day that an activity took place.
06/04/1998; 06/05/1998
Day of Week
The specific name of the day.
Monday; Tuesday
Holiday
Identifies that this day is a holiday.
Easter; Thanksgiving
Type of Day
Indicates whether or not this day is a weekday or a weekend day.
Weekend; Weekday
Calendar Week
The week ending date, always a Saturday.  Note that WE denotes week ending.
WE 06/06/1998;
WE 06/13/1998
Calendar Month
The calendar month.
January,1998; February, 1998
Calendar Quarter
The calendar quarter.
1998Q1; 1998Q4
Calendar Year
The calendar year.
1998
Fiscal Week
The week that represents the corporate calendar. Note that the F in the data value indicates that this is a fiscal time period.
F Week 1 1998;
F Week 46 1998
Fiscal Month
The fiscal period comprised of 4 or 5 weeks. Note that the F in the data value indicates that this is a fiscal time period.
F January, 1998;
F February, 1998
Fiscal Quarter
The grouping of 3 fiscal months.
F 1998Q1; F1998Q2
Fiscal Year
The grouping of 52 fiscal weeks / 12 fiscal months that comprise the financial year.
F 1998; F 1999
Calendar Month No.
1,2 ,3
Fiscal Month No.
1,2,3

Advantages for using a date dimension are as below

1. Can be scalable to different calender views

One of the benefit of using a surrogate key in a date dimension is that it can be scalable to different grains of use of date model and to different calender views. For example some reports may want to use a Gregorian  calender while other may want to use a Fiscal calender. There can also be a custom view of calender required in some companies . To accommodate such requirements and support different views a surrogate key is a better option.

2. Can be scalable to different grains

A grain in realm of dimensional model is defined as the lowest level at which data is stored. Grain can change in every data mart depending upon how the data is available for that particular business process. In case of date depending of data it can be at times at day level , week level or month level. For example in a typical HR system , the data related to Hiring and attrition is loaded daily so grain for this Day , while data related to labor time-sheets is loaded weekly so grain for that data mart will be week and for salary data it is monthly .

Now if want to load data at month grain into a data mart re-using a date dimension which is at a day level , then a surrogate key is a good option.

For example if our grain is month, we need the dim should behave as a Month dimension.
It can behave as a month dimension even while having date-level records. Each record has the date, month, quarter, and year-level attributes defined so from a single date we can determine all the month-level attributes we would need for display.

3. Better performance for Date related operations

One of the main reason for using a integer surrogate key in a date dimension is to improve the performance of date related functions . A typical requirements in data warehousing is the finding date intervals. If we use actual dates as key then we would need to use a “Date Interval functions” typically “Effective Date ” or “date between” function for such operations. Considering millions of records   such operation literally sucks at reporting layer.

A better option is use a sequential integer based surrogate key for dates so that such operations becomes simply an arithmetic operation which can be carried out very fast. For example the surrogate key for 1Jan 2012 is 1000 then surrogate for 30 Jan 2012 will be 1030. So to find this period instead of using date function simply do a “Surrogate key2-Surrogate key1” i.e. 1030-1000=30 days. This option gives much better performance as it is not using any database functionality & avoid UN-neccesary loading of DB queries.


4. Better performance at Joins

Using a “date” to join the dim is never idea compared to having a SK join. A surrogate key which is integer based will be at max a 2 byte entity while a date is 6-8 byte entity depending on various database vendors. If you use a lower size key it not just saves memory but gives a much better performance at joins. A join of 2 byte entity (SK in our case ) vs a join of date entity which 6-8 byte will be tremendously better and can be deduced from the common sense.

5. Better I/O & Data-warehouse Optimization

Both above points also adds to lower I/O operations for the server since it makes queries less memory intensive and faster. This means lower CPU utilization for server and better I/O operations.

It must be kept in mind that Date dimension is not just another dimension in a dimensional model but the most important and often most frequently used dimensional in BI reports . It does make sense to begin the data warehouse optimization from this & whatever steps taken in optimization of date dimension goes a long way in overall  impact.

Advertisements

2 thoughts on “Modelling Tip- Why it makes sense to ALWAYS have a surrogate key for Date Dim?

  1. Virtually everything in this post is incorrect. Let’s go point by point:

    1. Can be scalable to different calender views

    A ‘natural’ date can be converted to a fiscal date with a one time offset, which is done in front of the query, not in the query. If what the author said was true, we would be using surrogate keys for *all* date columns, and we are not.

    2. Can be scalable to different grains

    This can easily be handled at the application level or with simple constraints on the date type. If I need to intermix different time series (say, weekly with quarterly data), I’ll use a non-surrogate string, say ‘Y2012M01’ in order to exploit composite keys for joinless time searching (read: orders of magnitude faster).

    3. Better performance for Date related operations

    This is completely false. All database systems I know of express dates internally as integers and so date adjustment is really just simple integer operations.

    4. Better performance at Joins

    This is the worst argument. If you inline the date into the table, the join can be completely skipped. How can you get better performance than nothing? Also, inlined dates open the door for composite indexing for ‘time range + X’ queries, a common use case, which is impossible with surrogate keys.

    5. Better I/O & Data-warehouse Optimization

    Date types are typically 3-4 bytes and expressed as integers. 2 byte types are also frequently padded to four bytes for alignment padding anyways. So it’s a wash, except that now you need to expand to a join whenever a time based search is needed.

  2. Some good points but here could be clarifications.

    1- The purpose of surrogate key or how it can be used is not understood properly. The clients can have various customized calenders as well apart from fiscal & gregarion calenders . In fact the definition of fiscal calenders various from country to country and in todays global clients multiple views need to be generated on the fly. A “one time offset” is not a solution in such cases as it would be gracefully extendable.

    2.Handling the processing at application level as a thumb-rule always costlier than doing it in backend. The padding also makes it a string which will be heavier to process than a integer variable.

    3. Database express dates in different format based upon their compression algorithm. It can vary from 4 bytes to 6 to 8 ; however none that I know of can do in 2 bytes and this is where it is advantageous.

    4.The joins are inevitable in a star schema design of data warehousing and so the comparison was made in that context. A “star query” is how its typically formed in BI apps and that is where faster joins are necessity.

    5. You can refer to another article on same here. The performance benefits are well documented by industry over a period of time.

    http://www.ewsolutions.com/resource-center/rwds_folder/rwds-archives/issue.2007-04-04.5238588497/document.2007-04-04.9201893591

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s