Monday 18 May 2020

Fuzzy Matching in Python

So two of the things that I am interested in for my current project are pattern matching / record linking and data cleansing and validation and how best to achieve this in python. I have used built in tools and done some wrangling of my own in SQL in both Oracle and SQL-Server but it will be interesting to see how python handles it and how easily I pick it up. 

I was doing some digging into what tools were available and came across things like Fuzzy Wuzzy and other thinks that were clearly fuzzy matching but my favourite so far has been the Python Record Linkage toolkit. 

Honestly at the moment it is a struggle to remember how to do that much Python, whilst it is certainly coming back there are a lot of the basics I am having to check on StackOverflow just to remind myself of them! Too much ETL in Oracle recently and not enough diversity. 

The data: 

When I am a bit further into this I will look into some proper data but at the moment I am just trying to do some prototyping of the things I find. The first thing I am going to do is some matching on names. In all data jobs I have had you always have forms with free-form entry and you always get typing errors so being able to match names based on certain thresholds or criteria is really useful. 

Basically at the moment I am going to work with 2 very simple csv documents and see what sort of matching I can find in Python. Personally I would still likely want to import this data into a database and kick off a package in there to do the matching. Be interesting to see if, after exploring all the matching options, whether I get a better set of results from Oracle or Python and how easy I find it.


The left shows some of the code that I will embed below and the right therefore shows the output from the first 2 prints is showing the contents of the 2 csv documents. 

This seems simple enough and the matches make sense, looking at the documentation you can apply different algorithms to the matches etc. but something about it didn't sit right with me. 

Start Embed: 



End: 

So next I came across something that looked like it would more or less work directly in the Pandas DataFrame. 

This is now my new favourite and is probably what I am going to use, at least in terms of a basic matching system. Would love to see if I could create a machine learning process for this and I also want to write some sort of data validation process but that requires me looking up some more complicated dataset. 




No comments:

Post a Comment