Extracting text from a PDF - All pages and Output - file using Python [duplicate]

Question

Im new on Python. I am using this code to extract text. Is it possible extract all pages and have an output in a file?

import PyPDF2
pdf_file = open('sample.pdf','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(10)
page_content = page.extractText()
print (page_content)

I think you can refer to this link stackoverflow.com/questions/17003185/… except pypdf2 — mikewolfli
– mikewolfli, Commented Apr 10, 2017 at 3:46

kindall · Accepted Answer · 2017-04-14 17:54:38Z

10

Use a loop to extract each page's text and write each page's text to a single file.

import PyPDF2
with open('sample.pdf','rb') as pdf_file, open('sample.txt', 'w') as text_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    for page_number in range(number_of_pages):   # use xrange in Py2
        page = read_pdf.getPage(page_number)
        page_content = page.extractText()
        text_file.write(page_content)

edited Apr 14, 2017 at 17:54

answered Apr 10, 2017 at 3:33

kindall

185k36 gold badges291 silver badges321 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Raquel Dourado Over a year ago

perfect! it worked! But... is it possible to read this kind of pdf? cotemar.com.br/biblioteca/administracao/…

kindall Over a year ago

It appears that that PDF is a scanned book. Even if it has been OCR'd, I have no idea whether PyPDF can deal with it.

halfelf · Accepted Answer · 2018-12-24 05:53:06Z

I used following code to convert multiple pdf files into txt

p

df_dir = "D:/search/pdf"
txt_dir = "D:/pdf_to_text"


corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
pdfWriter = PyPDF2.PdfFileWriter()


for filename in corpus:

    pdf = open(join(pdf_dir, filename),'rb')
    pdfReader = PyPDF2.PdfFileReader(pdf)


    for page in range(1, pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        pdfWriter.addPage(pageObj)

        text = pageObj.extractText()


        page_name = "{}-page{}.txt".format(filename[:4], page + 1)

        with open(join(txt_dir, page_name), mode="w", encoding='UTF-8') as o:
            o.write(text)

This code works properly, but for each file I have multiple pages , when I run above code it gives me data as file1-page1.txt, file1-page2.txt, file1-page3.txt. but I want file.txt contains information for all pages . How I can do it.

harsh · Accepted Answer · 2018-12-28 08:02:20Z

    def getPptContent(path, text):
        pdfWriter = PyPDF2.PdfFileWriter()
        pdf = open(join(pdf_dir, filename),'rb')
        pdfReader = PyPDF2.PdfFileReader(pdf)
        for page in range(1, pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        pdfWriter.addPage(pageObj)
        text = pageObj.extractText()
  return text

  pdf_dir = "pdf_directory name"
  corpus = [str(f) for f in os.listdir(pdf_dir) if not f.startswith('.') and 
            isfile(join(pdf_dir, f))]

 for filename in corpus:
   Path = pdf_dir + "/" +filename
   print(Path)
   file_content = getPptContent(Path)
   f = open(pdf_dir + "/output/" + filename.split(".")[0]  +".txt" ,"w+", 
       encoding="utf-8")
   f.write(str(file_content))
   f.close()

Above code works for me.

Collectives™ on Stack Overflow

Extracting text from a PDF - All pages and Output - file using Python [duplicate]

3 Answers 3

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Linked

Related