2

I need to call the pdfminer top level python script from my python code:

Here is the link to pdfminer documentation:

https://github.com/pdfminer/pdfminer.six

The readme file shows how to call it from terminal os prompt as follows:

pdf2txt.py samples/simple1.pdf

Here, the pdf2txt.py is installed in the global space by the pip command:

pip install pdfminer.six

I would like to call this from my python code, which is in the project root directory:

my_main.py (in the project root directory)

for pdf_file_name in input_file_list:
   # somehow call pdf2txt.py with pdf_file_name as argument
   # and write out the text file in the output_txt directory

How can I do that?

1
  • Can you import the script like a module? add a main in the script, if it doesn't have one already. then when you import the file, you can just call main. Commented Dec 21, 2018 at 1:13

2 Answers 2

1

I think you need to import it in your code and follow the examples in the docs:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
interpreter.process_page(page)

I don't see any point of using shell given you are doing something usual.

Sign up to request clarification or add additional context in comments.

Comments

0

I would suggest two ways to do this!

  1. Use os

    import os
    os.system("pdf2txt.py samples/simple1.pdf")
    
  2. use subprocess

    import subprocess
    subprocess.call("pdf2txt.py samples/simple1.pdf", shell=True)
    

1 Comment

Thank you. This worked. Which one of these is better? Using os.system or subprocess.call? And why?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.