Wednesday, 9 June 2021

Data Profiling in Python

Pandas Profiling

I was looking into what new Python Libraries I should spend some time learning and that could be useful in my role as a Data Warehouse Developer and I came across Pandas Profiling. I am hoping to get this connected to my database to see if it is as useful as it sounds.

The basic description is that you create a dataframe with your data, this dataframe could come from a csv or a database table then then run a single line of code, ProfileReport in pandas_profiling and it will perform exploratory data analysis for you. What does this mean?

Well it means that it can produce some analytics on your dataset with a few clicks and generate a report. This sounds great for the first look at a new dataset. The documentation has some examples such as this one:

https://pandas-profiling.github.io/pandas-profiling/examples/master/census/census_report.html

You can see the dataset and the report. It shows things like duplicate row, null values and within a column for numerical values it gives you min, max % 0’s etc. and for category style columns you get a list of top values. Within each column you can drill into the statistics giving further details of the medium, percentiles, histograms and common values.

There is also a warnings tab where it gives you some idea as to what data might be missing or duplicated etc.

D-Tale

This package is like the Profiling package but allows you to visualise the data (again from a data-frame) in the form of a pivot chart. I haven’t really had time to investigate this yet but much like the one above it sounds useful.

No comments:

Post a Comment