only loading text in content of url in python

Question

I would like to download normal webpages, webhosted ppt and pdfs in python. However, to minimize that amount of data that I would need to download, I would like to download just the text and ignore any images.

This sounds feasible with normal websites, I'm not sure if its possible for ppt and pdfs. How can I accomplish this?

I'm planning to use the textract module to extract the content of these pages after downloading them, but I'd be interested to know if there are alternatives that would make my problem easier to solve.

zsquare · Accepted Answer · 2016-02-25 08:49:41Z

1

Take a look at the textract library. This accomplishes pretty much all your requirements, ie, html, pdf and ppt.

answered Feb 25, 2016 at 8:49

zsquare

10.2k6 gold badges56 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

only loading text in content of url in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related