I would like to download normal webpages, webhosted ppt and pdfs in python. However, to minimize that amount of data that I would need to download, I would like to download just the text and ignore any images.
This sounds feasible with normal websites, I'm not sure if its possible for ppt and pdfs. How can I accomplish this?
I'm planning to use the textract module to extract the content of these pages after downloading them, but I'd be interested to know if there are alternatives that would make my problem easier to solve.