10

I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.

<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span> 
<a href="/find/astro-ph/1/au:+Lin_D/0/1/0/all/0/1">Dacheng Lin</a>, 
<a href="/find/astro-ph/1/au:+Remillard_R/0/1/0/all/0/1">Ronald A. Remillard</a>, 
<a href="/find/astro-ph/1/au:+Homan_J/0/1/0/all/0/1">Jeroen Homan</a> 
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span> 
<a href="/find/astro-ph/1/au:+Kosovichev_A/0/1/0/all/0/1">A.G. Kosovichev</a>
</div>

<!--There are many other div tags with this structure-->
</body>
</html>

My point of confusion is that when I do soup.find, it finds the first occurrence of the div tag that I'm searching for. After that, I search for all 'a' link tags. At this stage, how do I extract the authors names from each of the link tags and print them out? Is there a way to do it using BeautifulSoup or do I need to use Regex? How do I continue iterating over every other other div tag and extract the authors names?

import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)

    try:

        authordiv = soup.find('div', attrs={'class': 'list-authors'})
        links=tds.findAll('a')


        for link in links:
            print ''.join(link[0].contents)

        #Iterate through entire page and print authors


    except IOError: 
        print 'IO error'

2 Answers 2

13

just use findAll for the divs link you do for the links

for authordiv in soup.findAll('div', attrs={'class': 'list-authors'}):

Sign up to request clarification or add additional context in comments.

Comments

1

Since link is already taken from an iterable, you don't need to subindex link -- you can just do link.contents[0].

print link.contents[0] with your new example with two separate <div class="list-authors"> yields:

Dacheng Lin
Ronald A. Remillard
Jeroen Homan
A.G. Kosovichev

So I'm not sure I understand the comment about searching other divs. If they are different classes, you will either need to do a separate soup.find and soup.findAll, or just modify your first soup.find.

2 Comments

And if there are more div tags, how do I iterate over those ones?
If you search by CSS class you then get a list of elements and you can iterate with a for loop (see: crummy.com/software/BeautifulSoup/bs4/doc/…). Do something like: authordiv = soup.find('div', class_ = 'list-authors').

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.