0

Please do not use "tika" for an answer. I have already tried answers from this question:

How to extract text from a PDF file?

I have this PDF file, https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing , and I would like to copy the text.

import PyPDF2
pdfFileObject = open('C:\\Path\\To\\Local\\File\\Test_PDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

The output is "Date Submitted: 2019-10-21 16:03:36.093 | Form Key: 5544" which is only part of the text. The next line of text starts with "Exhibit A to RFA...."

2
  • Can you explain what you mean by which is only part of the text? The reader reads line by line and hence it's giving the right sequential output. Commented Jul 30, 2020 at 19:44
  • @ AzyCrw4282 I am trying to get all the text in the PDF not just the first line. Commented Jul 30, 2020 at 19:56

2 Answers 2

1

I have never used PYPDF2 myself so can't really input my knowledge to find out exactly what's going wrong. But the following from the documentation states the following about the function extractText()

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Here's an alternative way to get around this and also exaplains what maybe going wrong. I would also recommend using pdftotext. This has worked reliably for me many times; this answer will also prove helpful in that.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks this set me on the right track. Posting my answer shortly. pip install pdfminer.six was the key.
great, happy to help
1

Found a solution.

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.