0

I am trying to convert bytes which I get from book_download_page = requests.get(link) then content = book_download_page.content into string.

What I have tried,

content = book_download_page.content.decode('utf-8')

Error I get,

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Edit- You can try this link for downloading

Thank you!

2
  • 2
    Does this answer your question? How to extract text from a PDF file? Commented Jun 25, 2020 at 3:39
  • Try other decodings like 'latin-1' and please give link will check and give you solution Commented Jun 25, 2020 at 3:40

1 Answer 1

1

PDF contents are made up of tokens, see here:

Adobe PDF Reference

You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.

import pdfbox

pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf')   # Result .txt will be in directory/originalPDF.txt

Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.