9

Need to parse a PDF file in order to extract just the first initial lines of text, and have looked for different Python packages to do the job, but without any luck.

Having tried:

  • PDFminer, PDFminer.six and PDFminer3k, which appears to be overly complex for the simple job, and I was unable to find a simple working example

  • slate, got error in installation, though worked with fix from thread, but got error when trying; maybe using wrong PDFminer, but can't figure which to use

  • PyPDF2 and PyPDF3 but these gave garbage as described here

  • tika, that gave different terminal error messages and was very slow

  • pdftotext failed to install

  • pdf2text failed at "import pdf2text", and when changed to "pdftotext" failed to import with "ImportError: cannot import name 'Extractor'" even through pip list shows that "Extractor" is installed

Usually I find that installed Python packages work amazingly well, but parsing PDF to text appears to be a jungle, which the myriad of tools also indicates.

Any suggestion of how to do simple parsing of a PDF file to text in Python?

PyPDF2 example added

An example of PyPDF2 is:

import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj_0 = pdfReader.getPage(0)
print(pageObj_0.extractText())

Which returns garbage as:

$%$%&%&$'(' ˜!)"*+#

1
  • 1
    Please, don't close this question... I simply seek some Python code that works... if SO is not for that, then what is it for? Commented Jan 24, 2020 at 13:28

1 Answer 1

8

Based on pdfminer, I was able to extract the bare necessity from the pdf2txt.py script (provided with pdfminer) into a function:

import io

from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def pdf_to_text(path):
    with open(path, 'rb') as fp:
        rsrcmgr = PDFResourceManager()
        outfp = io.StringIO()
        laparams = LAParams()
        device = TextConverter(rsrcmgr, outfp, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
    text = outfp.getvalue()
    return text
Sign up to request clarification or add additional context in comments.

3 Comments

I have found this solution to work fairly well but if you compare the output to the original pdf file you'll find discrepancies. Sentences are cut and reordered. Paragraphs having missing sentences, etc.
@Kamil i wonder if these issues are artifacts of the specific pdf file itself or a result of how it was created in the first place
@oldboy, that's a good question. I've never had much luck parsing PDFs. I actually found that converting the PDF to a simply word or txt file and then applying a parsing algorithm works out better. This is specific to parsing words only, because once you go after anything else besides text (ie. images) the process breaks down.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.