0

I have a Python script, that converts PDF content to a string.

text = list();

#npages is number of pages in the PDF file.
for n in range(npages):
    text[n] = os.system('pdftotext myfile.pdf -') #the "-" prints to stdout.

print(text)

However when I print text, this is the output (a PDF file with two pages):

{0: 0, 1: 0}

When running the script, I see the os.system output being sent to the command line:

text from myfile.pdf page 1
text from myfile.pdf page 2

How can I store the standard output from the pdftotext command in a list?

3
  • But you create a dictionary at line 1? Is it list or dictionary? Commented May 21, 2019 at 10:12
  • @Wimanicesir Ah sorry - fixed! Commented May 21, 2019 at 10:13
  • ① if text were a list, you will receive an IndexError when you try to access the non-existing element text[0] ② at every iteration you are receiving the whole text of the PDF file, not just the text of an individual page. Very sloppy question. Commented May 21, 2019 at 10:26

1 Answer 1

4

You are not receiving command line outputs, only the returned system code. Generally 0 is for success, which means your commands for npages 0 and 1 were successful.

You may use subprocess and redirect output to your Python script. A shorthand for this is:

import subprocess

out = subprocess.check_output(['ls', '-lh']) # example
print(out)

To accept the short - you'll need to use subprocess.Popen(bufsize=0). This should work:

cmd = ['pdftotext', 'myfile.pdf', '-']
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, bufsize=0)
# get output and error
out, err = proc.communicate()

print(out)
Sign up to request clarification or add additional context in comments.

6 Comments

I am a little confused on how/where to use the subprocess for pdftotext call. If I do: text[n] = subprocess.check_output('pdftotext myfile.pdf -) I get the following error: FileNotFoundError: [Errno 2] No such file or director: 'pdftotext myfile.pdf -'
@oliverbj check doc first, before you ask! It should be subprocess.check_output(['pdftotext', 'myfile.pdf', '-']) which only accepts commands in a list.
@oliverbj In the answer the argument to check_output is a list of strings, in your comment it is a string. Maybe is this difference the reason way of your erros?
@oliverbj Looks like it's caused by the last -. You'll need subprocess.Popen for that. I'll update my answer.
@knh190 - I just tried doing subprocess.check_output(['pdftotext', 'myfile.pdf', '-']) - but that returns with an error code of 99. I also tried removing the - part.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.