Sunday 30 August 2020

Creating a Home Data Warehouse from a Random Dataset

 I think this is a great project for budding ETL developers, it looks good if you are able to install and configure a database on your home PC. If you can then implement the design of a Data Warehouse and ETL on said database you are well place for many interviews. 

On that note....

For an interview recently I was given a file of data and asked to create a data model for it, just the design. I decided to implement the steps to actually create the data in a basic and mini star schema design. Honestly I struggle with finding data to create my home warehouse, it is something I want to do and at the moment I have decided to give it a good crack.

What I build for the interview was clearly something for an ad-hoc one off process but when I am building this new Data Warehouse my biggest issue is going to be that I don’t currently have a free front-end reporting tool with which to create reports. I have some ideas that I might play with but for now I will work on a generic (and very bad spec) of:

“Create the most flexible star schema from the following dataset, taking into account that future datasets may need to be added”

So yeah broad.

The initial task was to come up with a dataset to analyse as I didn’t want to use the original one in case the same example is used for future candidates and they somehow find this series of posts.

I was likely to just grab something from Kaggle but wanted to ensure I found a dataset that interested me and had other datasets that would could clearly be integrated as well. Some idea’s I have:

1) Amazon Movies / TV shows

2) Covid Data

3) Cricket data of some kind

4) Running Data

Although the data was quite small and similar to something else I analysed I came across a nice easy dataset for the first one and decided to run with it to start with: https://www.kaggle.com/nilimajauhari/amazon-prime-tv-shows

DataSet

So once I have downloaded the data I did a bit of basic exploration of it, as it was so small and a csv I opened it in NotePad++ followed by some spreadsheet software.

On the face of it we have a very simple dataset, there are some quirks especially having the comma separated list of categories within a csv file.

For now this small dataset is going to be my starting point. Hopefully there will be much more to come on this, honestly I don't know if I will keep up this project or remember to do the blog as I am very busy. 

No comments:

Post a Comment