Automate Office with Python: Slowly Changing Dimensions

Posts so far:

A Data Warehouse is a central repository of data that should be acting as the single source of truth for Business intelligence reporting throughout an organisation. In some places the Enterprise Data Warehouse is often the only place that some data sources have been integrated together and might be the easiest place for management to obtain the information they need to drive the business.

One of the key choices the Warehouse and Business teams need to make is how are they going to cope with changing dimensional data. If Betty has an address of 28 Street X for orders before 201910 and then moves to 5 Street Y, you would want the orders in the relevant time frames associated with the correct address at the time, well probably. but what about date of birth? If this had been changed from 1972 to 1992 then this would make a difference to metrics where you have been aggregating by age ranges, would you want your facts to be summed by incorrect data just because that is how it was at the time or would you want it reflecting the new and most accurate data?

There are different "types" of ways to cater for Slowly Changing Dimensions, SCD0, SCD1, SCD2 etc, they have different benefits and personally I have found using SCD2 processing with business modelling generating a latest view of the dimension a very flexible way to cater for this. Firstly lets discuss the different types and then I will explain the the use of SCD2 for the above example.

Types of SCD processing:

Type 0 – No change or fixed Dimension

Dimension never changes, so Betty will always have been born in 1972 and live at the first address she reported.

Type 1 – Latest Only Dimension

Historical Data is removed or update to reflect the latest, so Betty is always born in 1992 and lives at the last submitted address.

Type 2 – Time framed dimension

Track changes with to and from effective dates and often using current flag or latest flags with the active dates.

There are other types that are frankly not worth going into. You can read more about them here.

Why I rely on SCD2 processing:

Now I said above that I currently favour relying on SCD2 processing. This type has the advantage of allowing you to join on a surrogate key or time period (depending on what processing you are using) or pick the current version which allows you to get the Business Intelligence team greater flexibility when generating their reports and dash-boards, though it does come with risks.

Source Data in the Data Lake from source System

As the data has come in we have 3 different rows in the slowly changing dimension "data lake" table. For simplicity I have used changes over a small time frame but the from and to dates could be years apart. Betty was a new customer in 201809, changed her address in 201811 and then her date of birth in 201901, note the last entry is shown as current and as such has a null to date.

Latest Customer Dimension

From the source data we can create a dimension of the latest data, obviously this can change each month to reflect the latest data available and as such reporting would update accordingly.

Customer Dimension by period

There are several techniques you can use to effectively generate the above, obviously you can use surrogate keys in the SCD2 table and use these in the fact table in place of the customer id. Equally an option shown here is to store the customer data as at every year month, if the table is partitioned on year month the joins can be incredibly efficient and performant reporting can take place.

Simple Orders Fact

Some sample Fact Data for some order by Betty, linking to the customer dimension on the customer id and either the year month or the current flag.

Using the above data it is then relatively simple to produce a report like the above. the problem with this report is that we know the age is incorrect in the first 2 entries, so lets switch and use the latest dimension.

Now the issue becomes that we know that not all orders were placed at 5 Street Y, so instead we have to make sure our reporting team, through their business knowledge and use of the data dictionary, use the right fields for the right dimension. Note that in the modelling tool the age here would be called latest provided age or something like that.

So here we have our correct report, all using SCD2 processing but modelling it in 2 different ways, once using the by period data and once using the latest.

Automate Office with Python

Pages

Tuesday, 1 September 2020

Slowly Changing Dimensions

Types of SCD processing:

Why I rely on SCD2 processing:

No comments:

Post a Comment