How to convert PDF files encoded in unicode into text using Python 3 and PyPDF2

Question

I am trying to convert PDFs into text files using Python 3 and PyPDF2 library. But PDFs are mainly written in Korean so it seems to be encoded in 'utf-8' before processing PDF text. But either reading PDF files with "open" function or one with "codecs" function doesn't seem to work at all to extract appropriately 'utf-8'-encode text. Do you have any ideas to extract text from PDF files by using Python 3 and any other relevant Python libraries? Thanks in advance!

(You can download an example file via http://dart.fss.or.kr/pdf/download/pdf.do?rcp_no=20180402005019&dcm_no=6060273)

import PyPDF2
import codecs 

pdf_file = open('6060273.pdf','rb')
#pdf_file = codecs.open('6060273.pdf', 'rb', encoding='utf-8')

read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(20)
page_content = page.extractText()
print(page_content.encode('utf-8'))

iSerd · Accepted Answer · 2018-12-17 15:53:47Z

1

It seems to me that your problem is rather related to your fonts sources installed on your machine. The basic package which comes with PyPDF does not include whole universe of UTF8 in advance due to the fact that having such all options included library could increase the size of it. However you can install the necessary fonts in the directory.

answered Dec 17, 2018 at 15:53

iSerd

1781 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to convert PDF files encoded in unicode into text using Python 3 and PyPDF2

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related