18

How can I extract text from a PDF file in Python?

I tried the following:

import sys
import pyPdf

def convertPdf2String(path):
      content = ""
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      for i in range(0, pdf.getNumPages()):
          content += pdf.getPage(i).extractText() + " \n"
          content = " ".join(content.replace(u"\xa0", u" ").strip().split())
      return content

f = open('a.txt','w+')

f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace"))
f.close()

But the result is as follows, rather than readable text:

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

3
  • 3
    A PDF file must not necessarily contain text (appearing as such) in a reasonable exportable way since there are various options how a PDF creation tool can deal with text. There is no guarantee that you can extract as a whole as you want it. I assume your PDF is one of those PDF files that look nice but in the way that you can extract the content in a reasonable way. Commented Mar 23, 2013 at 5:17
  • I think this is similar issue as I had here: link. If you need the information contained in such PDF file, your best bet would be to dump TIFF (i.e with ghostscript) and do OCR (i.e tesseract). Commented Mar 23, 2013 at 10:53
  • pypdf received tons of updates in 2022. The results would be different if you upgrade your pypdf version Commented Mar 1, 2023 at 17:36

1 Answer 1

21

if you are running linux or mac you can use ps2ascii command in your code:

import os

input="someFile.pdf"
output="out.txt"
os.system(("ps2ascii %s %s") %( input , output))
Sign up to request clarification or add additional context in comments.

3 Comments

@anony try pdftotext instead of ps2ascii
what if i have to use it temporary ,, just for further processing of the text.
@Moj It prints 0 instead of the text in the file.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.