Parsing and indexing PDF in Python

I have a Doxie Go scanner and I scan all the documents I receive in paper. That's nice, but it creates another problem. All the resulting PDF files have to be named, organized and stored... Doing that manually is boring and time consuming. Of course that's something I want to automate!

I even bought Hazel a while ago. It's a nice software that monitors files in a folder and performs specific instructions based on the rules you defined. It works well but I felt a bit limited and I thought I could probably write something more tailored to my use case. And that would be more fun :-)

Parsing PDF in python

A quick solution I found was to run pdftotext using subprocess. I looked atPDFMiner, a pure Python PDF parser but I found pdftotext output to be more accurate. On MacOS, you can install it using Homebrew :

$ brew install Caskroom/cask/pdftotext

Here is a simple Python function to do that:

In[1]: import subprocess def parse_pdf(filename): try: content = subprocess.check_output(["pdftotext", '-enc', 'UTF-8', filename, "-"]) except subprocess.CalledProcessError as e: print('Skipping {} (pdftotext returned status {})'.format(filename, e.returncode)) return None return content.decode('utf-8')

Let's try to parse a pdf file. We'll use requests to download a sample file.

In[2]:

import requests url = 'http://www.cbu.edu.zm/downloads/pdf-sample.pdf' response = requests.get(url) with open('/tmp/pdf-sample.pdf', 'wb') as f: f.write(response.content)

Let's first look at the PDF:

In[3]:

from IPython.display import IFrame IFrame('http://www.cbu.edu.zm/downloads/pdf-sample.pdf', width=600, height=870)

Out[3]:

Nothing complex. It should be easy to parse.

In[4]:

content = parse_pdf('/tmp/pdf-sample.pdf') content

Out[4]:

"Adobe Acrobat PDF Files\nAdobe Portable Document Format (PDF) is a universal file format that preserves all of the fonts, formatting, colours and graphics of any source document, regardless of the application and platform used to create it. Adobe PDF is an ideal format for electronic document distribution as it overcomes the problems commonly encountered with electronic file sharing. Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat Reader. Recipients of other file formats sometimes can't open files because they don't have the applications used to create the documents. PDF files always print correctly on any printing device. PDF files always display exactly as created, regardless of fonts, software, and operating systems. Fonts, and graphics are not lost due to platform, software, and version incompatibilities. The free Acrobat Reader is easy to download and can be freely distributed by anyone. Compact PDF files are smaller than their source files and download a page at a time for fast display on the Web.\n\n \n\n \n\n\x0c"

This works quite well. The layout is not respected but it's the text that matters. It would be easy to define some regex to define rules based on the PDF content.

This could be the first step in naming and organizing the scanned documents. But it would be nice to have an interface to easily search in all the files. I've already used MongoDB full text search in a webapp I wrote and it worked well for my use case. But I read about Elasticsearch and I always wanted to give it a try.

Elasticsearch Ingest Attachment Processor Plugin

I could just index the result from pdftotext, but I know there is a plugin that can parse PDF files.

The Mapper Attachments Type plugin is deprecated in 5.0.0. It has been replaced with the ingest-attachment plugin. So let's look at that.

Running Elasticsearch

To run Elasticsearch, the easiest is to use Docker. As the official image from Docker Hub comes with no plugin, we'll create our own image. See Elasticsearch Plugin Management with Docker for more information.

Here is our Dockerfile :

FROM elasticsearch:5 RUN /usr/share/elasticsearch/bin/elasticsearch-plugin install ingest-attachment

Create the elasticsearch-ingest docker image:

$ docker build -t elasticsearch-ingest .

We can now run elasticsearch with the ingest-attachment plugin:

$ docker run -d -p 9200:9200 elasticsearch-ingest

Python Elasticsearch Client

We'll use elasticsearch-py to interact with our Elasticsearch cluster.

In[5]:

from elasticsearch import Elasticsearch es = Elasticsearch()

Let's first check that our elasticsearch cluster is alive by asking about its health:

In[6]:

es.cat.health()

Out[6]:

'1479333419 21:56:59 elasticsearch green 1 1 0 0 0 0 0 0 - 100.0%\n'

Nice! We can start playing with our ES cluster.

As described in the documentation , we first have to create a pipeline to use the Ingest Attachment Processor Plugin:

PUT _ingest/pipeline/attachment { "description" : "Extract attachment information", "processors" : [ { "attachment" : { "field" : "data" } } ] }

OK, how do we do that using the Python client?

In[7]: body = { "description" : "Extract attachment information", "processors" : [ { "attachment" : { "field" : "data" } } ] } es.index(index='_ingest', doc_type='pipeline', id='attachment', body=body) Out[7]:

{'acknowledged': True}

Now, we can send a document to our pipeline. Let's start by using the same example as in the documentation:

PUT my_index/my_type/my_id?pipeline=attachment { "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" }

Using Python client, this gives:

Latest Images

Trending Articles

Latest Images