Automate Office with Python: Extracting text from images

Monday, 12 February 2018

Extracting text from images

At the moment one of my projects is to look at document storage and image processing. I am thinking, just for fun, to try and implement a home image processing tool from scanned images. Basically the process would be to scan an image and then run a python script which would extract the words and then based on these words put the images into a specified location as per instructions from a table in a database. At the moment I am in the very early stages of this, having just started looking at it today. Initially I was thinking of implementing a machine learning algorithm to determine the logo in the picture but this didn't seem practical given what I know so far about ML with pictures.

My initial challenge for this project was to investigate how to extract the text from image. Of course knowing python I thought there would be something built in but in the end I ended up installing pytesseract (Google it, was not much fun to install, I will try and document the steps I took at some point using a virtual env).

Now of course any software doing this is not going to be perfect and really I need to learn some of the Pillow library to allow me to do some of the image processing. For now I took a photo of a school form for my daughter. Running the whole thing was not successful but cropping to get the text out word fantastically. So the key will be processing the images to ensure I am passing in the best possible input.

The image:

The Output: (pasted straight from iPython Console

After School Club Booking Form - Individual Sessions - Term 4

The code:

Automate Office with Python

Pages

Monday, 12 February 2018

Extracting text from images

No comments:

Post a Comment