0

I have been trying to print the output to a new text file. But I get the error

TypeError: expected a character buffer object

What I'm trying to do is convert pdf to text and copy the text obtained to a new file.

import pyPdf

def getPDFContent():
  content = ""
  # Load PDF into pyPDF
  pdf = pyPdf.PdfFileReader(file("D:\output.pdf", "rb"))
  # Iterate pages
  for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    #content += pdf.getPage(i).extractText() + "\n"
    print pdf.getPage(i).extractText().encode("ascii", "ignore")

  # Collapse whitespace
  #content = " ".join(content.replace(u"\xa0", " ").strip().split())
  #return content

  #getPDFContent().encode("ascii", "ignore")
  getPDFContent()

  s =getPDFContent()
  with open('D:\pdftxt.txt', 'w') as pdftxt:
      pdftxt.write(s)

I did try to initialize s as str but then I get the error as "can't assign to function call".

5
  • Your getPDFContent() function doesn't return anything. print is not the same thing as return. Commented Jun 7, 2014 at 19:02
  • @Martijn plus I don't think there's meant to be a couple of recursive calls in there... So I'm guessing the indentation is not exactly reliable either Commented Jun 7, 2014 at 19:04
  • Your code sample is a bit of a mess. Can you clean it up (fix the indentation, remove obsolete comments, etc.). Include the actual attempt; I suspect the print version posted here is not your only version you tried. Commented Jun 7, 2014 at 19:04
  • I had even tried return before but the only thing i got was page 1,the rest of pages never appeared in my textfile. But print was the only 1 that worked where in the interpreter displayed the complete output but didn't copy it to a new text file. Commented Jun 7, 2014 at 19:18
  • possible duplicate of TypeError: expected a character buffer object - while trying to save integer to textfile Commented May 6, 2015 at 20:42

1 Answer 1

1

You are not returning anything getPDFContent() so basically you are writing None.

 result=[]
 for i in range(0, pdf.getNumPages()):
     result.append(pdf.getPage(i).extractText().encode("ascii", "ignore")) # store all in a list
 return result


 s = getPDFContent()
 with open('D:\pdftxt.txt', 'w') as pdftxt:
    pdftxt.writelines(s) # use writelines to write list content

How your code should look:

def getPDFContent():
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file("D:\output.pdf", "rb"))
    # Iterate pages
    result = []
    for i in range(0, pdf.getNumPages()):
        result.append(pdf.getPage(i).extractText().encode("ascii", "ignore"))
    return result

s = getPDFContent()
with open('D:\pdftxt.txt', 'w') as pdftxt:
    pdftxt.writelines(s)
Sign up to request clarification or add additional context in comments.

1 Comment

The comments in the function suggest more was tried. But as it stands currently it is a mess.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.