Sunday, 10 April 2022

Free Data Solution - Overview

Below is an overview of the data solution I came up with. I would say data warehouse but this is not a data warehouse. So my solution relies on the data source of google sheets populated by Google form but if you have a better data source that you can access freely then great. 

This file is imported on a schedule by the Keboola Engine. The Flow I have created loads the data source in. As I am just starting out with all these systems so it is doing a full load. The file is then stored in the bucket on Keboola and pushed into a MySQL database I have created (again for free in the cloud). I then have a view sitting over the table although that logic could easily be in the next step of the snowflake transformation. The Snowflake transformation takes place on a Snowflake database where you can run a series of SQL statements and processes to trans form your data. The output from this is then pushed back into the MySQL database. I have then connected Google Data Studio to the mySQL database to act as the visualisation software. 


You can check out some of the output here. 



Saturday, 9 April 2022

Free Data Warehouse - Data Source

So I decided to embark on trying to build my own free data platform. If nothing else this could be a little project for someone looking to get into Data Engineering and Warehousing to have a go at. 

Honestly I have not yet decided on the subject area I am going to use but am looking into the tooling and processes. To give me a basic starting point on testing out the tooling etc. I am going to use a really basic Google Form and Spreadsheet as my data sources. I am an avid runner but do not do enough strength training so I set up a form to track what I am doing to help motivate me. 

The form can be seen in the screen shot below, though there are more fields for other types of exercise.



This form feeds straight into a spreadsheet which then acts as the data source for the ETL. Now this data source could be a transactional database, a file in an S3 bucket or whatever source you want. For ease of use downstream I have set all the column headers to be in capitals, this seemed to be beneficial downstream when working with Snowflake which didn't play nice with case changes.  

Saturday, 4 September 2021

Automation Stuff

Most of my automation work recently has been creating a standard template in Excel that saves on time and human error when producing meta data into the standard ETL engine for our enterprise data warehouse. 

On this project I have, mainly through stackoverflow, developed a couple of macros that have made my life much easier and some crazy formulas to contain the logic to determine the correct template to use. The most important thing is that the end result is a template that can be used by 20+ people in the team and regularly save them time and provide consistency in the scripts provided to the deployment team. 

As much as I have really enjoyed getting my eye back into VBA and on a project that has had lots of positive feedback I still think that there are better ways forwards in general. I have been looking into python again and wanted to see what libraries are out there since I last used it in anger. Honestly most of the recommended ones have not changed however it was still a good reminder as to what is out there, even if most are not suitable for any use case I have. 

I found the following article very interesting in terms of looking at what is available for performing ETL functions within python. 


https://www.xplenty.com/blog/comparison-of-the-top-python-etl-tools/