How to convert bytes from PDF to string in Python?

Question

I am trying to convert bytes which I get from book_download_page = requests.get(link) then content = book_download_page.content into string.

What I have tried,

content = book_download_page.content.decode('utf-8')

Error I get,

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Edit- You can try this link for downloading

Thank you!

Does this answer your question? How to extract text from a PDF file? — metatoaster
– metatoaster, Commented Jun 25, 2020 at 3:39
Try other decodings like 'latin-1' and please give link will check and give you solution — NAGA RAJ S
– NAGA RAJ S, Commented Jun 25, 2020 at 3:40

user176692 · Accepted Answer · 2020-06-25 03:46:35Z

1

PDF contents are made up of tokens, see here:

You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.

import pdfbox

pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf')   # Result .txt will be in directory/originalPDF.txt

Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.

answered Jun 25, 2020 at 3:46

user176692

8401 gold badge7 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1