How to get text from local PDF file using Python

Question

Please do not use "tika" for an answer. I have already tried answers from this question:

I have this PDF file, https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing , and I would like to copy the text.

import PyPDF2
pdfFileObject = open('C:\\Path\\To\\Local\\File\\Test_PDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

The output is "Date Submitted: 2019-10-21 16:03:36.093 | Form Key: 5544" which is only part of the text. The next line of text starts with "Exhibit A to RFA...."

Can you explain what you mean by which is only part of the text? The reader reads line by line and hence it's giving the right sequential output. — AzyCrw4282
– AzyCrw4282, Commented Jul 30, 2020 at 19:44
@ AzyCrw4282 I am trying to get all the text in the PDF not just the first line. — Jortega
– Jortega, Commented Jul 30, 2020 at 19:56

AzyCrw4282 · Accepted Answer · 2020-07-30 20:12:11Z

1

I have never used PYPDF2 myself so can't really input my knowledge to find out exactly what's going wrong. But the following from the documentation states the following about the function extractText()

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Here's an alternative way to get around this and also exaplains what maybe going wrong. I would also recommend using pdftotext. This has worked reliably for me many times; this answer will also prove helpful in that.

answered Jul 30, 2020 at 20:12

AzyCrw4282

7,8945 gold badges26 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jortega Over a year ago

Thanks this set me on the right track. Posting my answer shortly. pip install pdfminer.six was the key.

AzyCrw4282 Over a year ago

great, happy to help

Jortega · Accepted Answer · 2020-07-31 01:00:28Z

Found a solution.

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

Collectives™ on Stack Overflow

How to get text from local PDF file using Python

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related