0

I have a PDF full of quotes:

https://www.pdf-archive.com/2017/03/22/test/

I can extract the text in python using the following code:

import PyPDF2

pdfFileObj = open('example.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
print (pageObj.extractText())

This returns all the quotes as one paragraph. Is it possible to 'split' the pdf by the horizontal separator and split it into quotes that way?

2
  • Can you please provide example of a text and an example of how you want it to look. Commented Mar 22, 2017 at 21:21
  • The link will go to the PDF. In this PDF there are two phrases. I am looking to extract the two phrases/quotes into two string variables which I will then process further. Commented Mar 22, 2017 at 21:23

3 Answers 3

1

If you want to just extract the quotes from the pdf text you can use regex to find all the quotes.

import PyPDF2
import re
pdfFileObj = open('test.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
text = str(pageObj.extractText())

quotes = re.findall(r'"[^"]*"',text)
for quote in quotes:
    print quote
    print 

or just

quotes = re.findall(r'"[^"]*"',text)
print quotes
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you very much for that. It seems to be working. Could you explain the regex? I can't quite make sense of it. My main goal is extracting pdf-archive.com/2017/03/22/pdf maybe you could have a quick look at that. I thought I might be able to learn from this pdf but not sure if I can.
what do you want to extract from the pdf actually ? there are no quotes and totally different from your question.
The end goal is JSON pairs of reference, date, applicant, location and proposal. I did up the other pdf to try and learn to then apply it to the main pdf but I don't think the code is transferrable .
0

i could not find a way to split it by the horizontal separator, but i managed to do it in another way:

import PyPDF2

quotes = []

pdfFileObj = open('test.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
for x in (pageObj.extractText()).split('"\n'): print x+"\n"*5

Comments

0
import pdfplumber

pdf = pdfplumber.open(file_path)

p0 = pdf.pages[0]

text = p0.extract_text()

text

1 Comment

Add context around your answer for future learning purpose and downvote prevention. (From Review).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.