5

I just started learning web scraping using Python. However, I've already ran into some problems.

My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)

The problem: I'm unable to extract all of the species names.

This is what I have so far:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(html_doc)

spans = soup.find_all(

From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...

Any input will be highly appreciated!

5 Answers 5

4

You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:

scientific_names = [it.text for it in soup.table.find_all('i')]

Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.

You should read up on what BS actually does, it seems like you're underestimating its utility.

Sign up to request clarification or add additional context in comments.

Comments

4

What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:

import urllib2
from BeautifulSoup import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(page)

scientific_names = [it.text for it in soup.table.findAll('i')]

print scientific_names

1 Comment

Indeed findAll has been renamed to find_all to be pep8 compliant. More information here.
2

Looking at the web page, I'm not sure exactly about what information you want to extract. However, note that you can easily get the text in a tag using the text attribute:

>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']

Comments

1

Thanks everyone...I was able to solve the problem I was having with this code:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)

scientific_names = [it.text for it in soup.table.find_all('i')]

for item in scientific_names:
print item

2 Comments

Don't forget to accept the answer that helped you most as the correct answer.
... so it would be appropriate to marks Joe's Answer as the correct one... this helps to keep people from jumping in to answer thinking no-one's worked it out for you.
0

If you want a long term solution, try scrapy. It is quite simple and does a lot of work for you. It is very customizable and extensible. You will extract all the URLs you need using xpath, which is more pleasant and reliable. Still scrapy allows you to use re, if you need.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.