2

I'm trying to write a basic web crawler in Python. The trouble I have is parsing the page to extract url's. I've both tried BeautifulSoup and regex however I cannot achieve an efficient solution.

As an example: I'm trying to extract all the member urls in Facebook's Github page. (https://github.com/facebook?tab=members). The code I've written extracts member URL's;

def getMembers(url):
  text = urllib2.urlopen(url).read();
  soup = BeautifulSoup(text);
  memberList = []
    #Retrieve every user from the company
    #url = "https://github.com/facebook?tab=members"

  data = soup.findAll('ul',attrs={'class':'members-list'});
  for div in data:
    links = div.findAll('li')
    for link in links:
          memberList.append("https://github.com" + str(link.a['href']))

  return memberList

However this takes quite a while to parse and I was wondering if I could do it more efficiently, since crawling process is too long.

4
  • Have you tried using a different parser? You can use the lxml parser with beautiful soup, making it quite quick. Commented Nov 6, 2012 at 22:23
  • @kreativitea I'm checking it right now. Thanks a lot for the help! Commented Nov 6, 2012 at 22:23
  • 1
    Sure, this not your internet connection? Processing itself should be quick. My suggestions: write your output to a file, and check how long it takes. Commented Nov 6, 2012 at 22:40
  • 1
    measure separately how long it takes to get text (urllib2) and to find links in it (BeautifulSoup). You could use timeit.default_timer() or run python -mcProfile your_script.py github might be responding sloowly. Commented Nov 7, 2012 at 0:51

3 Answers 3

1

I suggest that you use GitHub API, that let you do exactly what you want to accomplish. Then it's only a matter of using a json parser and you are done.

http://developer.github.com/v3/orgs/members/

Sign up to request clarification or add additional context in comments.

Comments

1

In order to prevent writing the scraper yourself you can use available ones. Maybe try scrapy, it uses python and it's available on github. http://scrapy.org/

Comments

0

Check the post Extremely Simple Web Crawler for a simple and easy to understand python script that crawls webpages and collects all the valid hyperlinks depending on the seed URL and depth:

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.