Web Crawler in Python

Question

I'm trying to write a basic web crawler in Python. The trouble I have is parsing the page to extract url's. I've both tried BeautifulSoup and regex however I cannot achieve an efficient solution.

As an example: I'm trying to extract all the member urls in Facebook's Github page. (https://github.com/facebook?tab=members). The code I've written extracts member URL's;

def getMembers(url):
  text = urllib2.urlopen(url).read();
  soup = BeautifulSoup(text);
  memberList = []
    #Retrieve every user from the company
    #url = "https://github.com/facebook?tab=members"

  data = soup.findAll('ul',attrs={'class':'members-list'});
  for div in data:
    links = div.findAll('li')
    for link in links:
          memberList.append("https://github.com" + str(link.a['href']))

  return memberList

However this takes quite a while to parse and I was wondering if I could do it more efficiently, since crawling process is too long.

Have you tried using a different parser? You can use the lxml parser with beautiful soup, making it quite quick. — kreativitea
– kreativitea, Commented Nov 6, 2012 at 22:23
@kreativitea I'm checking it right now. Thanks a lot for the help! — Ali
– Ali, Commented Nov 6, 2012 at 22:23
Sure, this not your internet connection? Processing itself should be quick. My suggestions: write your output to a file, and check how long it takes. — RParadox
– RParadox, Commented Nov 6, 2012 at 22:40
measure separately how long it takes to get text (urllib2) and to find links in it (BeautifulSoup). You could use timeit.default_timer() or run python -mcProfile your_script.py github might be responding sloowly. — jfs
– jfs, Commented Nov 7, 2012 at 0:51

themiurgo · Accepted Answer · 2012-11-12 16:29:13Z

1

I suggest that you use GitHub API, that let you do exactly what you want to accomplish. Then it's only a matter of using a json parser and you are done.

http://developer.github.com/v3/orgs/members/

answered Nov 12, 2012 at 16:29

themiurgo

1,5902 gold badges13 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

s1na · Accepted Answer · 2012-11-12 17:53:09Z

1

In order to prevent writing the scraper yourself you can use available ones. Maybe try scrapy, it uses python and it's available on github. http://scrapy.org/

answered Nov 12, 2012 at 17:53

s1na

1133 silver badges12 bronze badges

Comments

Guest · Accepted Answer · 2014-05-05 17:21:27Z

0

Check the post Extremely Simple Web Crawler for a simple and easy to understand python script that crawls webpages and collects all the valid hyperlinks depending on the seed URL and depth:

answered May 5, 2014 at 17:21

Guest

1

Collectives™ on Stack Overflow

Web Crawler in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related