Extracting text from PDF in Python [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed 3 months ago.

Improve this question

I have a PDF full of quotes:

https://www.pdf-archive.com/2017/03/22/test/

I can extract the text in python using the following code:

import PyPDF2

pdfFileObj = open('example.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
print (pageObj.extractText())

This returns all the quotes as one paragraph. Is it possible to 'split' the pdf by the horizontal separator and split it into quotes that way?

Can you please provide example of a text and an example of how you want it to look. — Stanley Kirdey
– Stanley Kirdey, Commented Mar 22, 2017 at 21:21
The link will go to the PDF. In this PDF there are two phrases. I am looking to extract the two phrases/quotes into two string variables which I will then process further. — user7692855
– user7692855, Commented Mar 22, 2017 at 21:23

bhansa · Accepted Answer · 2017-03-22 21:35:21Z

1

If you want to just extract the quotes from the pdf text you can use regex to find all the quotes.

import PyPDF2
import re
pdfFileObj = open('test.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
text = str(pageObj.extractText())

quotes = re.findall(r'"[^"]*"',text)
for quote in quotes:
    print quote
    print

or just

quotes = re.findall(r'"[^"]*"',text)
print quotes

answered Mar 22, 2017 at 21:35

bhansa

7,5763 gold badges35 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user7692855 Over a year ago

Thank you very much for that. It seems to be working. Could you explain the regex? I can't quite make sense of it. My main goal is extracting pdf-archive.com/2017/03/22/pdf maybe you could have a quick look at that. I thought I might be able to learn from this pdf but not sure if I can.

bhansa Over a year ago

what do you want to extract from the pdf actually ? there are no quotes and totally different from your question.

user7692855 Over a year ago

The end goal is JSON pairs of reference, date, applicant, location and proposal. I did up the other pdf to try and learn to then apply it to the main pdf but I don't think the code is transferrable .

Liam Giannini · Accepted Answer · 2017-03-22 21:26:38Z

0

i could not find a way to split it by the horizontal separator, but i managed to do it in another way:

import PyPDF2

quotes = []

pdfFileObj = open('test.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
for x in (pageObj.extractText()).split('"\n'): print x+"\n"*5

answered Mar 22, 2017 at 21:26

Liam Giannini

6,4975 gold badges42 silver badges59 bronze badges

Comments

JIJOMON K.A · Accepted Answer · 2019-07-10 09:21:44Z

0

import pdfplumber

pdf = pdfplumber.open(file_path)

p0 = pdf.pages[0]

text = p0.extract_text()

text

edited Jul 10, 2019 at 9:21

JIJOMON K.A

1,2903 gold badges13 silver badges29 bronze badges

answered Jul 10, 2019 at 8:43

jainam shah

2341 silver badge13 bronze badges

1 Comment

ZF007 Over a year ago

Add context around your answer for future learning purpose and downvote prevention. (From Review).

Collectives™ on Stack Overflow

Extracting text from PDF in Python [closed]

3 Answers 3

3 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

1 Comment

Linked

Related