Working with PDF files in Python

All of you must be familiar with what PDFs are. In-fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format . It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.

Invented by Adobe , PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video, and business logic.

In this article, we will learn, how we can do various operations like:

Extracting text from PDF Rotating PDF pages Merging PDFs Splitting PDF Adding watermark to PDF pages

using simple python scripts!

Installation

We will be using a third-party module, PyPDF2.

PyPDF2 is a python library built as a PDF toolkit. It is capable of:

Extracting document information (title, author, …) Splitting documents page by page Merging documents page by page Cropping pages Merging multiple pages into a single page Encrypting and decrypting PDF files and more!

To install PyPDF2, run following command from command line:

pip install PyPDF2

This module name is case sensitive, so make sure the y is lowercase and everything else is uppercase. All the code and PDF files used in this tutorial/article are available here .

1. Extracting text from PDF file
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()

Output of above program looks like this:

20
PythonBasics
S.R.Doty
August27,2008
Contents
1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

Let us try to understand the above code in chunks:

pdfFileObj = open('example.pdf', 'rb')

We opened the example.pdf in binary mode.and saved the file object as pdfFileObj .

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Here, we create an object of PdfFileReader class of PyPDF2 module andpass the pdf file object & get a pdf reader object.

print(pdfReader.numPages)

numPagesproperty gives the number of pages in the pdf file. For example, in our case, it is 20 (see first line of output).

pageObj = pdfReader.getPage(0)

Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object.

print(pageObj.extractText())

Page object has function extractText() to extract text from the pdf page.

pdfFileObj.close()

At last, we close the pdf file object.

Note:While PDF files are great for laying out text in a way that’s easy for people to print and read, they’re not straightforward for software to parse into plaintext. As such, PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. There isn’t much you can do about this, unfortunately. PyPDF2 may simply be unable to work with some of your particular PDF files.

2. Rotating PDF pages # importing the required modules
import PyPDF2
def PDFrotate(origFileName, newFileName, rotation):
# creating a pdf File object of original pdf
pdfFileObj = open(origFileName, 'rb')
# creating a pdf Reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# creating a pdf writer object for new pdf
pdfWriter = PyPDF2.PdfFileWriter()
# rotating each page
for page in range(pdfReader.numPages):
# creating rotated page object
pageObj = pdfReader.getPage(page)
pageObj.rotateClockwise(rotation)
# adding rotated page object to pdf writer
pdfWriter.addPage(pageObj)
# new pdf file object
newFile = open(newFileName, 'wb')
# writing rotated pages to new file
pdfWriter.write(newFile)
# closing the original pdf file object
pdfFileObj.close()
# closing the new pdf file object
newFile.close()
def main():
# original pdf file name
origFileName = 'example.pdf'
# new pdf file name
newFileName = 'rotated_example.pdf'
# rotation angle
rotation = 270
# calling the PDFrotate function
PDFrotate(origFileName, newFileName, rotation)
if __name__ == "__main__":
# calling the main function
main()

Here you can see how the first page of rotated_example.pdf looks like ( right image) after rotation:

Some important points related to above code:

For rotation, we first create pdf reader object of the original pdf. pdfWriter = PyPDF2.PdfFileWriter()

Rotated pages will be written to a new pdf. For writing to pdfs, we use object of PdfFileWriter class of PyPDF2 module.

for page in range(pdfReader.numPages):
pageObj = pdfReader.getPage(page)
pageObj.rotateClockwise(rotation)
pdfWriter.addPage(pageObj)

Now, we iterate each page of original pdf. We get page object by getPage() method of pdf reader class. Now, we rotate the page by rotateClockwise() method of page object class. Then, we add page to pdf writer object using addPage() method of pdf writer class by passing the rotated page object.

newFile = open(newFileName, 'wb')
pdfWriter.write(newFile)
pdfFileObj.close()
newFile.close()

Now, we have to write the pdf pages to a new pdf file. Firstly we open the new file object and write pdf pages to it using write() method of pdf writer object. Finally, we close the original pdf file object and the new file object.

3. Merging PDF files # importing required modules
import PyPDF2
def PDFmerge(pdfs, output):
# creating pdf file merger object
pdfMerger = PyPDF2.PdfFileMerger()
# appending pdfs one by one
for pdf in pdfs:
with open(pdf, 'rb') as f:
pdfMerger.append(f)
# writing combined pdf to output pdf file
with open(output, 'wb') as f:
pdfMerger.write(f)
def main():
# pdf files to merge
pdfs = ['example.pdf', 'rotated_example.pdf']
# output pdf file name
output = 'combined_example.pdf'
# calling pdf merge function
PDFmerge(pdfs = pdfs, output = output)
if __name__ == "__main__":
# calling the main function
main() Output of above program is a combined pdf, combined_exa

Working with PDF files in Python

Trending Articles

LMD VCL Complete v2024.4

Dahon自救會之SP8火線救援

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

RAV4 E-Mirror電子式後視鏡無法連線

[一般] 至尊不動劍成長進化史給大家參考

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

出售: 100% New 抗鼻敏感噴霧 Budesonide PH&T 50 x 3

日本童顏巨乳女星比一比誰才是真的小學生？

动画「Visual Prison」BD第三卷封面公开

【查】土星在第三宫的表现 (豆瓣草菇@占星社区小组)

Artweaver 7.0.17 免安裝中文版 (8.0.4 安裝版) - 小型繪圖軟體

《踏血寻梅》拍援交妹命案春夏露点争新人奖

日活罗曼晴色粉红电影系列目录

有人買民雄嘉大博識嗎?(或美銓建設以前的建案)

[心得] 從來沒碰過魔獸世界的新手照過來,一篇文章就讓你快速上手!

宝可梦无限融合6.4.6最新汉化版+福利版，PC端+安卓端

出售:美國JBL,Paul Audio 出品15吋低音喇叭

creator的editbox怎么隐藏键盘

請問Rogue這個故障燈號是什麼意思？

【露營趣】中和 TNR-060 鋁合金休閒桌摺疊桌野餐桌露營桌折疊桌蛋捲桌一桌四椅板凳桌椅組