Python Web Scripting

Question

I wanted to do this before for some websites but didn't know where to start. This time however I am adamant. I am talking about the scripts where we crawl a website and extract the data we require. My target is this: Basically I have to appear for job interviews in December. There is this site (http://www.geeksforgeeks.org/) which contains large number of questions from previous interviews (like http://www.geeksforgeeks.org/amazon-interview-set-42-on-campus/ & http://www.geeksforgeeks.org/adobe-interview-set-6-campus-mts-1/). Every title has word "set" and a number in it. It is quite cumbersome to keep track of what I have done and what not. So I want to extract questions from each of these pages and put them in a pdf with the title. How can I do this using curl, regex and Scrapy? I am intermediate in C/C++/Java and but have only beginner proficiency in Python. Any help is much appreciated. Also point me to any such scripts you such know of. I want to do this on my own. Just requires a starting point and some guidance. Thanks.

Marcin · Accepted Answer · 2013-09-11 18:00:19Z

3

If you want just a starting point, try scrapy a screen-scraping library for python. I would recommend that you use the requests library for making requests. It's by far the simplest option (with no loss of power).

Also, don't try to parse html or xml with a regex. Just don't. Use one of the fine libraries available (beautifulsoup or lxml, or lxml with a beautifulsoup backend are the most popular, but there are others).

edited Sep 11, 2013 at 18:00

answered Sep 11, 2013 at 17:54

Marcin

50.1k18 gold badges137 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rishiag Over a year ago

Thanks. Also do I need to manually collect all the links I want my crawler to crawl?

Marcin Over a year ago

@user1425223 Unless...there's a source for them. I'm not sure what else you would expect. You could automate that collection process probably.

Collectives™ on Stack Overflow

Python Web Scripting

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related