Web Scraping data using python?

Question

I just started learning web scraping using Python. However, I've already ran into some problems.

My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)

The problem: I'm unable to extract all of the species names.

This is what I have so far:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(html_doc)

spans = soup.find_all(

From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...

Any input will be highly appreciated!

joe · Accepted Answer · 2012-03-05 08:35:30Z

4

You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:

scientific_names = [it.text for it in soup.table.find_all('i')]

Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.

You should read up on what BS actually does, it seems like you're underestimating its utility.

edited Mar 5, 2012 at 8:35

answered Mar 5, 2012 at 8:20

joe

8276 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:22:19Z

4

What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:

import urllib2
from BeautifulSoup import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(page)

scientific_names = [it.text for it in soup.table.findAll('i')]

print scientific_names

edited May 23, 2017 at 12:22

CommunityBot

11 silver badge

answered Mar 5, 2012 at 9:09

BioGeek

23k23 gold badges90 silver badges156 bronze badges

1 Comment

jcollado Over a year ago

Indeed findAll has been renamed to find_all to be pep8 compliant. More information here.

jcollado · Accepted Answer · 2012-03-05 07:25:47Z

2

Looking at the web page, I'm not sure exactly about what information you want to extract. However, note that you can easily get the text in a tag using the text attribute:

>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']

answered Mar 5, 2012 at 7:25

jcollado

40.5k9 gold badges108 silver badges139 bronze badges

Comments

user1248092 · Accepted Answer · 2012-03-05 19:02:41Z

1

Thanks everyone...I was able to solve the problem I was having with this code:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)

scientific_names = [it.text for it in soup.table.find_all('i')]

for item in scientific_names:
print item

answered Mar 5, 2012 at 19:02

user1248092

3711 gold badge6 silver badges19 bronze badges

2 Comments

BioGeek Over a year ago

Don't forget to accept the answer that helped you most as the correct answer.

CLaFarge Over a year ago

... so it would be appropriate to marks Joe's Answer as the correct one... this helps to keep people from jumping in to answer thinking no-one's worked it out for you.

warvariuc · Accepted Answer · 2012-03-05 07:56:21Z

0

If you want a long term solution, try scrapy. It is quite simple and does a lot of work for you. It is very customizable and extensible. You will extract all the URLs you need using xpath, which is more pleasant and reliable. Still scrapy allows you to use re, if you need.

answered Mar 5, 2012 at 7:56

warvariuc

60.1k45 gold badges183 silver badges234 bronze badges

Collectives™ on Stack Overflow

Web Scraping data using python?

5 Answers 5

Comments

1 Comment

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related