ETL Tip- How to implement Surrogate Keys & Constraints in Data Warehouse?

his article is to explain how to implement the Surrogate keys from a logical dimensional model to a physical DBMS. There have been some clear cut guidelines on this matter from Ralph Kimball & Group but some people still manage to make a mess out of it , so it seems that it should be explained in finer details. This article has taken all the concepts from Ralph Kimball’s writing & only aims to explain them further for audience who may still have some doubts.

-What is a Surrogate Key?

By definition a surrogate key is a meaningless key assigned at the data warehouse to uniquely identify the values of that dimension . For example look at the typical star schema of logical dimensional model below-

Here all the dimensions are connected to central fact table in a one-many relationship. This relationship is identified by a surrogate key which acts as a primary key for that dimension (ie.e one surrogate key for each of dimension like Geography , Vendor , product etc )  and same is refered as the foreign key in the fact table.

-Surrogate Key in Physical DBMS Layer

While the surrogate is used to establish a one-many relationship between fact & dimension table where it acts a primary-foreign key identifier in dimension and fact table respectively , it must be kept in mind that this is a logical relationship and same is not meant to be implement in the database layer using the “Primary Key” method available in the database function. As a thumb rule we do not implement any keys or constraints in the physical layer and same is not required in the data warehousing systems .

In DW systems & in particular our dimensional its not relevant.They act as burden to ETL load & are cause of load failures & bring dw to its knees at times. As Ralph Kimball says below (which I think anyone with common sense can deduct) is that in DW keys /constraints enforced at physical level don’t add any value as they are process oriented objects unlike operational systems where they are entity identifier objects.

So the first point being that we dont “need” to enforce a primary key via database in an OLAP env since it is a controlled env loaded only through ETL & so not prone to random updates like OLTP env. The logic to generate & maintain the surrogate keys is already part of the ETL layer.

Apart from this surrogate keys as primary keys enforced at database layer comes with many serious drawbacks

– Surrogate keys are maintained using a SCD (slowly changing dimensions ) logic which determines whether a dimension row key need to updated , added new value or retain old value. This is driven by whether than col is SCD1 or 2 or 3 type and same is driven by business requirements. In such scenario a auto-generated primary key at database level implemented for a surrogate key can be a serious troublemaker . This can throw whole SCD logic out of window giving unexpected results

– Further a surrogate key is performing the function of a primary key logically in a “given context” of a data mart. This means if we were to re-use same dimension as a shrunk dimension in another data mart where the grain is different then it will not be unique at that level.

– Apart from this any database level operations adds a burden to ETL load in terms of reducing its performance. Any more database level keys will mean more IO , more CPU and less time for the ETL load to perform. This can become a nightmare if your data-warehouse contains a sizable chunk of data & is normally a cause of load failures as well.

-Foreign Key Constraints (RI) in DBMS Layer

While surrogate keys “logically” acts as a foreign key in the fact table , these foreign key constraints (Referential Integrity ) are also not required to be implemented in the DBMS layer. As Ralph Kimball says below this also proves to be more of problem than any value addition.

To explain this further here are few more points
– A data is loaded into the fact table only after data is verified and cleaned and loaded to dimension table. So there is no purpose served of implementing a RI at a fact table level.
–  A Fact table is voluminous table with millions of rows , and implementing RI is going to make it prone to ETL load failures as FK key enforcement incurs huge cost in terms of CPU & IO load on server.
– Some people chose to have a FK key constraint in fact table but NOT enforce them . This is normally done for DB2 systems as it is suppose to use this info for aggregate navigator in terms of MQT. Refer to below link for more details

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2z9.doc.apsg/src/tpc/db2z_inforeferentialconstraints.htm

This aspect is misleading normally because MQT are not used automatically. For reports normally a BIT tool is used & whichever MQTs we create will be used through BI Tool & so BI Tool query rewrite handles it well. In fact in almost all cases MQTs that we will use will be built for “a” report and so MQTs is used directly in all the cases. Further MQTs when created , since they hold the data , the querying on them is isolated from source tables. The implementation of MQT through BI Tool is the information where aggregate navigator is not used at all.

Apart from this the aggregate navigator optimization level of DB2 is not known to give good results and we can always derive better results by explicitly enforcing these things. MQT on the top of is only known to make matters worse and MQT refresh is single most reason for ETL failures in our observations.

-Political Challenge from DBA team

While all this approach is correct in theory , the biggest challenge comes in going forward with this is not technical but a political challenge from DBA team which Ralph Kimball also hinted at . DBA team who is not aware of data warehousing systems can be a nightmare to work with in such cases. Few points can be helpful in explaining them the superfluous nature of implementing keys & constraints at the  DBMS level.

– Implementing keys & constraints at physical layer is duplicating the effort of ETL team. This part is already taken care of by ETL team in its code.
– A good ETL means no garbage or random updates to data warehouse. So implementing such things at physical level is an oxymoron suggesting our ETL jobs is poor.
– Try a demo of ETL load with keys & constraints and without them , showing why they are performance killer.
– Finally try telling them OLAP and OLTP systems are poles apart and things totally different in OLAP world .

Advertisements

One thought on “ETL Tip- How to implement Surrogate Keys & Constraints in Data Warehouse?

  1. […] ETL Tip- How to implement Surrogate Keys & Constraints in … – ETL Tip- How to implement Surrogate Keys & Constraints in Data Warehouse? July 25, 2012 Leave a comment. his article is to explain how to implement the Surrogate keys from a logical dimensional model to a physical DBMS. … -Foreign Key Constraints (RI) … […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s