I am always on the look out for some interesting tools to use. In my research for my current project I came across a project called Flash text. This claims it is 1000x times faster, in large datasets, than a regular expression. This could be very useful for any fuzzy matching process. I have previously worked for a company that worked with insurance companies and if you were to do fuzzy matching there you could be matching ABC Insurance against ABC Ins Company, in this instance you want to remove both Insurance and Company (and Ins) in order to create a meaningful match.
Just take a look at the scoring matches in this basic example using a Jaro-Winkler match in Oracle.
Therefore an effective way of removing keywords without the overhead of using regular expressions appealed to me when I saw this library. I then set about loading in my CSV and replacing a string in that.
And here is the output from that (top 3 are before and bottom 3 are after). This could prove very useful in my attempt to create a generic, automated and useful fuzzy matching thing in python.
No comments:
Post a Comment