Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Parsing PDFs in Python with Tika

$
0
0

A few months ago, one of my friends asked me ifI could help him extract some data from a collection of PDFs. The PDFs contained records of his financial transactions over a period of years and he wanted to analyze them. Unfortunately, Excel and plain text versions of the files were no longer available, so the PDFs were his only option.

I reviewed a few python-based PDF parsers and decided to try Tika , which is a port of Apache Tika . Tika parsed the PDFs quickly and accurately. I extracted the data my friend needed and sent it to him in CSV format so he could analyze it with the program of his choice. Tika was so fast and easy to use that I really enjoyed the experience. I enjoyed it so much I decided towrite a blog post about parsing PDFs with Tika.


Parsing PDFs in Python with Tika
California Budget PDFs

To demonstrate parsing PDFs with Tika, I knew I’d need some PDFs. I was thinking about which ones to use and remembered a blog post I’d read on scraping budget data from a government website. Governments also provide data in PDF format, so I decided it would be helpful to demonstrate how to parse data from PDFs available on a government website . This way, with these two blog posts, you have examples of acquiring government data, even if it’s embedded in HTML or PDFs. The three PDFs we’ll parse in this post are:

2015-16 State of California Enacted Budget Summary Charts

2014-15 State of California Enacted Budget Summary Charts

2013-14 State of California Enacted Budget Summary Charts
Parsing PDFs in Python with Tika

Each of these PDFs contains several tables that summarize total revenues and expenditures, general fund revenues and expenditures, expenditures by agency, and revenue sources. For this post, let’s extract the data on expenditures by agency and revenue sources. In the 2015-16 Budget PDF, the titles for these two tables are:

2015-16 Total State Expenditures by Agency
Parsing PDFs in Python with Tika
2015-16 Revenue Sources
Parsing PDFs in Python with Tika

To follow along with the rest of this tutorial you’ll need to download the three PDFs and ensure you’ve installed Tika. You can download the three PDFs here:

http://www.ebudget.ca.gov/2015-16/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

http://www.ebudget.ca.gov/2014-15/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

http://www.ebudget.ca.gov/2013-14/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

You can install Tika by running the following command in a Terminal window:

pip install --user tika

IPython

Before we dive into parsing all of the PDFs, let’s use one of the PDFs, 2015-16CABudgetSummaryCharts.pdf, to become familiar with Tika and its output. We can use IPython to explore Tika’s output interactively:

ipython

from tika import parser

parsedPDF = parser.from_file("2015-16CABudgetSummaryCharts.pdf")

You can type the name of the variable, a period, and then hit tab to view a list of all of the methods available to you:

parsedPDF.


Parsing PDFs in Python with Tika

There are many options related to keys and values, so it appears the variable contains a dictionary. Let’s view the dictionary’s keys:

parsedPDF.viewkeys()

parsedPDF.keys()

The dictionary’s keys are metadata and content. Let’s take a look at the values associated with these keys:

parsedPDF["metadata"]

The value associated with the key “metadata” is another dictionary. As you’d expect based on the name of the key, its key-value pairs provide metadata about the parsed PDF.


Parsing PDFs in Python with Tika

Now let’s take a look at the value associated with“content”.

parsedPDF["content"]

The value associated with the key “content” is a string. As you’d expect, the string contains the PDF’s text content.


Parsing PDFs in Python with Tika

Now that we know the types of objects and values Tika provides to us, let’s write a Python script to parse all three of the PDFs. The script will iterate over the PDF files in a folder and, for each one, parse the text from the file, select the lines of text associated with the expenditures by agency and revenue sources tables, convert each of these selected lines of text into a Pandas DataFrame, display the DataFrame, and create and save a horizontal bar plot of the totals column for the expenditures and revenues. So, after you run this script, you’ll have six new plots, one for revenues and one for expenditures for each of the three PDF files, in the folder in which you ran the script.

Python Script

To parse the three PDFs, create a new Python script named parse_pdfs_with_tika.py and add the following lines of code:

#!/usr/bin/env python # -*- coding: utf-8 -*- import csv import glob import os import re import sys import pandas as pd import matplotlib matplotlib.use('AGG') import matplotlib.pyplot as plt pd.options.display.mpl_style = 'default'

from tika import parser

input_path = sys.argv[1] def create_df(pdf_content, content_pattern, line_pattern, column_headings): """Create a Pandas DataFrame from lines of text in a PDF. Arguments: pdf_content -- all of the text Tika parses from the PDF content_pattern -- a pattern that identifies the set of lines that will become rows in the DataFrame line_pattern-- a pattern that separates the agency name or revenue source from the dollar values in the line column_headings -- the list of column headings for the DataFrame """ list_of_line_items = [] # Grab all of the lines of text that match the pattern in content_pattern content_match = re.search(content_pattern, pdf_content, re.DOTALL) # group(1): only keep the lines between the parentheses in the pattern content_match = content_match.group(1) # Split on newlines to create a sequence of strings content_match = content_match.split('\n') # Iterate over each line for item in content_match: # Create a list to hold the values in the line we want to retain line_items = [] # Use line_pattern to separate the agency name or revenue source # from the dollar values in the line

Viewing all articles
Browse latest Browse all 9596

Trending Articles