Saturday, March 1, 2014

Slowly Changing Dimensions (SCD) - Data Warehouse

Dimension is a term in data management and data warehousing. It's the logical groupings of data such as geographical location, customer or product information. With Slowly Changing Dimensions (SCDs) data changes slowly, rather than changing on a time-based, regular schedule.

In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension's attribute value for given date. Example of such dimensions could be: customer, geography, employee.

There are many approaches how to deal with SCD. The most popular are:
  • Type 0 - The passive method
  • Type 1 - Overwriting the old value
  • Type 2 - Creating a new additional record
  • Type 3 - Adding a new column
  • Type 4 - Using historical table
  • Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)

Type 0 - The passive method. In this method no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.

Type 1 - This methodology overwrites old with new data, and therefore does not track historical data.
Example of a supplier table:

In the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the row will be unique by the natural key (Supplier_Code). However, to optimize performance on joins use integer rather than character keys.

In relational model database design, a natural key is a key that is formed of attributes that already exist in the real world. For example, a USA citizen's social security number could be used as a natural key. In other words, a natural key is a candidate key that has a logical relationship to the attributes within that row. A natural key is sometimes called a domain key.
If the supplier relocates the headquarters to Illinois the record would be overwritten:

The disadvantage of the Type I method is that there is no history in the data warehouse. It has the advantage however that it's easy to maintain.

If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.

Type 2 - This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.
For example, if the supplier relocates to Illinois the version numbers will be incremented sequentially:

The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.

Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed.

Type 3 - Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into 'old/previous' column. Generally speaking the history is limited to the number of column created for storing historical data. The original table structure in Type 1 and Type 2 is the same but Type III adds additional columns. In the following example, an additional column has been added to the table to record the supplier's original state - only the previous history is stored.

This record contains a column for the original state and current state—cannot track the changes if the supplier relocates a second time.

One variation of this is to create the field Previous_Supplier_State instead of Original_Supplier_State which would track only the most recent historical change.

Type 4 - The method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes. Both the surrogate keys are referenced in the Fact table to enhance query performance.

For the above example the original table name is Supplier and the history table is Supplier_History.

Type 6 - Combine approaches of types 1,2,3 (1+2+3=6).

The Supplier table starts out with one record for our example supplier:

The Current_State and the Historical_State are the same. The Current_Flag attribute indicates that this is the current or most recent record for this supplier.

When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:

We overwrite the Current_State information in the first record (Supplier_Key = 123) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing.

For example if the supplier were to relocate again, we would add another record to the Supplier dimension, and we would overwrite the contents of the Current_State column:

Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are always the same.

No comments:

Post a Comment

Share Your Inspiration...