I'd like to write a crawler using python. This means: I've got the url of some websites' home page, and I'd like my program to crawl through all the website following links that stay into the website. How can I do this easily and FAST? I tried BeautifulSoup already, but it is really cpu consuming and quite slow on my pc.
4 Answers
I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.
I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.
import mechanize
import lxml.html
br = mechanize.Browser()
response = br.open("somewebsite")
for link in br.links():
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
print br
br.back()
# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
print link
you can also get elements via root.xpath(). A simple wget might even be the easiest solution.
Hope I could be helpful.
3 Comments
for loop is doing what you intend.I like using mechanize. Its fairly simple, you download it and create a browser object. With this object you can open a URL. You can use "back" and "forward" functions as in a normal browser. You can iterate through the forms on the page and fill them out if need be. You can iterate through all the links on the page too. Each link object has the url etc which you could click on.
here is an example: Download all the links(related documents) on a webpage using Python
Comments
Here's an example of a very fast (concurrent) recursive web scraper using eventlet. It only prints the urls it finds but you can modify it to do what you want. Perhaps you'd want to parse the html with lxml (fast), pyquery (slower but still fast) or BeautifulSoup (slow) to get the data you want.
Comments
Have a look at scrapy (and related questions). As for performance... very difficult to make any useful suggestions without seeing the code.