Crawler with Python?

Question

I'd like to write a crawler using python. This means: I've got the url of some websites' home page, and I'd like my program to crawl through all the website following links that stay into the website. How can I do this easily and FAST? I tried BeautifulSoup already, but it is really cpu consuming and quite slow on my pc.

Often discussed...look at SO question dealing with "scrapy"

user2665694
– user2665694

2011-07-11 10:55:16 +00:00
Commented Jul 11, 2011 at 10:55 — user2665694
– user2665694, Commented Jul 11, 2011 at 10:55

ilprincipe · Accepted Answer · 2011-07-11 14:23:31Z

6

I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.

I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("somewebsite")

for link in br.links():
    print link
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
    print link

you can also get elements via root.xpath(). A simple wget might even be the easiest solution.

Hope I could be helpful.

edited Jul 11, 2011 at 14:23

answered Jul 11, 2011 at 12:11

ilprincipe

8666 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

cerberos Over a year ago

I don't think the first for loop is doing what you intend.

ilprincipe Over a year ago

i have only used mechanize for navigating through forms so far. isnt it simply printing each link on the site and then following it? br.back might be unnecessary. but it seems to me like its doing exactly that. posts here are similar. But since i am pretty new to python/programming in general, please correct me if I am wrong.

Matteo Monti Over a year ago

Anyway, I started using lxml and everything got faster! Thank you!

Community · Accepted Answer · 2017-05-23 12:02:54Z

3

I like using mechanize. Its fairly simple, you download it and create a browser object. With this object you can open a URL. You can use "back" and "forward" functions as in a normal browser. You can iterate through the forms on the page and fill them out if need be. You can iterate through all the links on the page too. Each link object has the url etc which you could click on.

here is an example: Download all the links(related documents) on a webpage using Python

edited May 23, 2017 at 12:02

CommunityBot

11 silver badge

answered Jul 11, 2011 at 11:06

Rusty Rob

17.3k10 gold badges103 silver badges121 bronze badges

Comments

cerberos · Accepted Answer · 2011-07-11 12:58:21Z

3

Here's an example of a very fast (concurrent) recursive web scraper using eventlet. It only prints the urls it finds but you can modify it to do what you want. Perhaps you'd want to parse the html with lxml (fast), pyquery (slower but still fast) or BeautifulSoup (slow) to get the data you want.

answered Jul 11, 2011 at 12:58

cerberos

8,0835 gold badges44 silver badges45 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 10:33:51Z

2

Have a look at scrapy (and related questions). As for performance... very difficult to make any useful suggestions without seeing the code.

edited May 23, 2017 at 10:33

CommunityBot

11 silver badge

answered Jul 11, 2011 at 10:55

Rob Cowie

22.6k6 gold badges65 silver badges58 bronze badges

Collectives™ on Stack Overflow

Crawler with Python?

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related