Automate Office with Python: Design considerations

I have been absent over the last few months for several different reason. Primarily because I knew there was a chance of a promotion at work. Where I work the jobs are advertised externally and internally to ensure the best person is found for the job. Given current COVID issues there are a lot of people looking for a job so there were over 30 high quality candidates and 7 reached an interview, including myself.

The role in question is a Lead Data Warehouse and BI Developer. Currently I am a Senior ETL Developer, the lead role works as a lead for a Data Feed and Subject Area within our Enterprise Data Warehouse. Happy to say that I was offered the role, once of the questions I asked to address and answer in the presentation stage was Top 5 Design Considerations when introducing a new data source into an Enterprise Data Warehouse.

Any questions or comments then feel free to post, please don't try and nick this presentation and my notes as although I have juggled things around for this post it is still my work and it took a lot of thought in the first place and to get it into this post.

Slide 2:

I thought the best way to address this question would be to start with my simplified view of what an Enterprise Data Warehouse is. The overview shown presents a simplistic view of multiple source systems, for example, transactional data or files such as from the ONS, finance systems etc., flowing into a landing or “data lake” area. From here an ETL engine will perform various transformations that will result in the production of data in a series of dimension and fact tables in star schemas. These dimension and fact tables are then presented to end users through a business intelligence tool, for example Oracle Analytics or ClickView.

So my first design consideration would be…

Slide 3:

Business or user reporting requirements and SLA’s. For a successful design we need to consider how are users going to access the data, what Facts and Dimensions do they need but also importantly can your reporting system cope with the demands. If you are unable to develop the dashboards or pixel perfect reports they want or they will take too long with the current set up then this needs to be highlighted early.

Slide 4:

Data profiling – This is the systematic analysis of the content of a data source, looking at the size of all the data and checking constraints and keys and whether the data can meet the high level goals of the data warehouse, such as is the data of high enough quality that we can actually use it?

An entity relationship diagram would be useful help identify relationships and other design considerations in the data profiling include: are you getting full extracts or delta’s? How does the data change?

Lastly, we also need to think about how this data is going to get into the Warehouse and whether you have the resources and scheduling available to handle the processing and volume of data in question with the other data sources that already exist.

Slide 5:

Can we integrate the data into our existing ecosystem. Some design issues that come to mind are:

Do we have an ETL tool that can handle the required transformations
Do we have the DB resources to perform the transformations in a performant way, at the desired frequency and without interrupting the existing processes.

Then my 4^th design factor to consider is integration of the modelling:

Can it be integrated into the existing repository e.g. can a tool such as the RPD work with integrating this model.

What Dimensions and Facts are needed within this model? What star schema’s are required? This relates back to point 1 on business requirements but also what conformed dimensions are already available within existing processes in the data warehouse and how will this fit together. And what hierarchies are required.

The previous points can all be linked together through the production of a conformance or bus matrix to visually represent the planned fact and dimensions along with the conformity.

Slide 6:

My final design consideration is security and data governance.

When on-boarding a new dataset we need to consider, especially with highly confidential patient data, do we have robust enough security in place for the data transfers, data storage and data presentation.

You also need to consider which data within the dataset you require, with GDPR it is increasingly important that you only have and retain the data that you actually need. Patient names for example may need to be pseudonymised or even anonymised. If any new sensitive data is involved you may need to design the requirements around the guidelines of the information governance team.

Automate Office with Python

Pages

Wednesday 26 August 2020

Design considerations - Enterprise data warehouse

No comments:

Post a Comment