2

I'd like to write a crawler using python. This means: I've got the url of some websites' home page, and I'd like my program to crawl through all the website following links that stay into the website. How can I do this easily and FAST? I tried BeautifulSoup already, but it is really cpu consuming and quite slow on my pc.

1
  • Often discussed...look at SO question dealing with "scrapy" Commented Jul 11, 2011 at 10:55

4 Answers 4

6

I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.

I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("somewebsite")

for link in br.links():
    print link
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
    print link

you can also get elements via root.xpath(). A simple wget might even be the easiest solution.

Hope I could be helpful.

Sign up to request clarification or add additional context in comments.

3 Comments

I don't think the first for loop is doing what you intend.
i have only used mechanize for navigating through forms so far. isnt it simply printing each link on the site and then following it? br.back might be unnecessary. but it seems to me like its doing exactly that. posts here are similar. But since i am pretty new to python/programming in general, please correct me if I am wrong.
Anyway, I started using lxml and everything got faster! Thank you!
3

I like using mechanize. Its fairly simple, you download it and create a browser object. With this object you can open a URL. You can use "back" and "forward" functions as in a normal browser. You can iterate through the forms on the page and fill them out if need be. You can iterate through all the links on the page too. Each link object has the url etc which you could click on.

here is an example: Download all the links(related documents) on a webpage using Python

Comments

3

Here's an example of a very fast (concurrent) recursive web scraper using eventlet. It only prints the urls it finds but you can modify it to do what you want. Perhaps you'd want to parse the html with lxml (fast), pyquery (slower but still fast) or BeautifulSoup (slow) to get the data you want.

Comments

2

Have a look at scrapy (and related questions). As for performance... very difficult to make any useful suggestions without seeing the code.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.