ETL Tip- How to Handle Null vs the Blank fields in Data Warehouse

While populating the data warehouse ETL team is always confused as to how to handle the “Logically NA” values of data. Here are some suggestions in that regards,

-What does “Logical Nulls” means?

There are always some cases in the data model when there is no data available for a field in some cases. Such “Data Not Available” cases which are in consistent with underlying business rules are called as logical nulls.

A typical example can be say for an HR data warehouse is fields like Education Qualification or marital status or such fields. There would be some employees for whom data is simply not available for these fields and so they are logical nulls.

Another example is derived flags in data model. We always have some flags in dimension table to aid the query performance at reports. For example a flag for PHD people which can be derived from the educational qualification. Now if the underlying educational qualification is logically null then such a derived field would also be “logically null”

Another typical case found in data warehousing is that of out-trigger dimensions or snow-flaked dimensions which can also be logically null in parent dimension if that value is not present .

– Null vs Blank vs Zero 

In any data warehouse it is very important to keep the consistency . So while a “logical null” can implemented at DBMS level /ETL layer by either a null or blank or zero , we must be sure that in any model , we adopt one thing and keep it same across all the dimensions . Nothing kills a good data model than a data inconsistency.

Now some people have issues in allowing physical nulls to data warehouse and their concerns typically are

– It may have unknown & unforeseen impact on the BI Reports, some rows may get dropped.
– Nulls in data looks bad and seems like a garbage data
– A default value assigned is a better option
– We can handle it case by case basis & no need to think over it.

– How to Handle Logical Nulls ?

While there may not clear cut best way to go about it , we surely know that worst mistake to do on this is to handle it vase by case basis . As already said it interstices an element of surprise and breaks consistency of data and every modeler must avoid it at all costs.

Now let us look at impact of assigning a default value for logical nulls. If we were to take the previous example , then for a PHD_Flag , we can have normal values as 1 or 0 depending if that person has a PHD qualification or not. For the people whose data we don’t have and are “logically null” , it means we would count them on either side . So if we assign default of 0 to this col it means you would end up counting those people into “Non PHD holders” as well whose data we dont have. This is data in-accuracy.

Instead we leave the null as it is in DBMS , we can maintain better data accuracy. We can count all three type people as below

Phd Holder = Count of “PHD_FLAG=1”
Non Phd Holder = Count of “PHD_FLAG=0”
No data Phd Holder = Count of “PHD_FLAG ISNULL”

The concerns that allowing nulls in DB has unknown impact on BI reports is normally from those who have never worked with BI reports. All the BI vendors like cognos , BO has a very good support for handling nulls , you can choose to suppress null values or display at reports and “never” does it ever drops such rows.

– Another solution deployed in these cases is that to replace such nulls with a blank value. A blank value is normally an empty string . While this option is doable it is not very good for following reasons
– This may not be applicable to all the cols as not col will be of char or varchar type
– Querying an empty string is more difficult than querying Nulls. All BI vendors and DBMS vendors has a good support for Null fields querying but not so good blank fields
– Blanks are difficult to identify & control on BI reports side as it would expect a custom function to be written each time when its cols used for reports

So in a nutshell the best option to handle such cases is normally to implement it as Null itself in DBMS . This must be noted this is not applicable to the fields which are marked as surrogate keys since surrogate keys by their very definition are not null-able objects.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s