2

I would like to download normal webpages, webhosted ppt and pdfs in python. However, to minimize that amount of data that I would need to download, I would like to download just the text and ignore any images.

This sounds feasible with normal websites, I'm not sure if its possible for ppt and pdfs. How can I accomplish this?

I'm planning to use the textract module to extract the content of these pages after downloading them, but I'd be interested to know if there are alternatives that would make my problem easier to solve.

1 Answer 1

1

Take a look at the textract library. This accomplishes pretty much all your requirements, ie, html, pdf and ppt.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.