How can I extract text from textboxes within a PDF in Python?

Question

I'm not having any luck with pyPDF2 or PDFMiner. The tools always return _______________ for the textboxes even if they are filled in. Does anyone have any idea on how to extract the text within the textbox fields?

stackoverflow.com/questions/15583535/…, stackoverflow.com/questions/34129936/…, stackoverflow.com/questions/26494211/… — Jesse
– Jesse, Commented May 25, 2018 at 1:15

A.Andruhovski · Accepted Answer · 2018-05-25 08:10:09Z

0

You need to extract text fields, not a text. So you need something like this:

import sys
import six
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

fp = open("c:\\tmp\\test.pdf", "rb")

parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog["AcroForm"])["Fields"]
for i in fields:
    field = resolve1(i)
    name, value = field.get("T"), field.get("V")
    print ("{0}:{1}".format(name,value))

answered May 25, 2018 at 8:10

A.Andruhovski

794 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Farhang Amaji Over a year ago

it didnt work for me and made KeyError: 'AcroForm' error.

Collectives™ on Stack Overflow

How can I extract text from textboxes within a PDF in Python?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related