Using BeautifulSoup with Python to parse page for attribute values

Question

I am trying to use Python with BeautifulSoup to go through a page that has sections with ids that are incrementing in value by 1, and I am trying to get their vids. However the # of vids are variable depending on the span id as you can see below, also it is not nested under the original tr.

Right now I am doing a loop to get the span id value, however I am trying to figure out a way to get the vid values as an array for each span id.

The following is an example html I am working with:

<tr>
    <td>
        <div>
            <span class="apple-font" id="001">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099882"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>


<tr>
    <td>
        <div>
            <span class="apple-font" id="002">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="003">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="004">
        </div>
    </td>
</tr>

<tr>
</tr>

The following is code I am using / have been trying to but have not made much progress yet on figuring out getting all the vids:

soup = soup.findAll(class_="apple-font", id=True)
for s in soup:       
   n = str(s.get_text().lstrip().replace(".",""))
   print n
print

Are these all in the same table?

Martijn Pieters
– Martijn Pieters

2015-03-09 16:40:27 +00:00
Commented Mar 9, 2015 at 16:40 — Martijn Pieters
– Martijn Pieters, Commented Mar 9, 2015 at 16:40
yes they are all in the same table, thanks!

user1982011
– user1982011

2015-03-09 16:41:09 +00:00
Commented Mar 9, 2015 at 16:41 — user1982011
– user1982011, Commented Mar 9, 2015 at 16:41

Martijn Pieters · Accepted Answer · 2015-03-09 16:52:40Z

1

I'd use an iterative approach; loop over all tr elements in the same table, starting from the first <span class="apple-font"> tag and start a new group each time you find a row with a new id:

table = soup.find(class_='apple-font', id=True).find_parent('table')
groups = {}
group = None
for tr in table.find_all('tr'):
    id_span = tr.find(class_='apple-font', id=True)
    if id_span is not None:
        # new group
        group = []
        groups[id_span['id']] = group
    else:
        vid_link = tr.find('a', vid=True)
        if vid_link is not None:
            group.append(vid_link['vid'])

Demo:

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="001">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099882"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="002">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="003">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="004">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... '''
>>> soup = BeautifulSoup('<table>{}</table>'.format(sample))
>>> table = soup.find(class_='apple-font', id=True).find_parent('table')
>>> groups = {}
>>> group = None
>>> for tr in table.find_all('tr'):
...     id_span = tr.find(class_='apple-font', id=True)
...     if id_span is not None:
...         # new group
...         group = []
...         groups[id_span['id']] = group
...     else:
...         vid_link = tr.find('a', vid=True)
...         if vid_link is not None:
...             group.append(vid_link['vid'])
... 
>>> print groups
{'003': ['0099883', '0099883'], '002': ['0099883'], '001': ['0099882', '0099883', '0099883'], '004': []}

edited Mar 9, 2015 at 16:52

answered Mar 9, 2015 at 16:44

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1982011 Over a year ago

Thanks a lot for this, it works well. A question I had is I have some other trs and tables above it with additional information. As a result, I am getting an error "AttributeError: 'NoneType' object has no attribute 'append'" I was wondering if there is some way it can look at these trs specifically i.e. items with id=00..? or is it already doing that? When I delete all the trs / tables above it, it works fine.

Martijn Pieters Over a year ago

@user1982011: that means you are seeing <a vid="..."> links before the first <span class="apple-font" id="..."> row is showing up. Rather than set groups = {} and group = None you can use group = [], groups = {'no id': group} and collect all those in a separate list keyed on 'no id'.

user1982011 Over a year ago

Thank you this works perfectly! One last question I had for you is if there are multiple <a vid="#####"></a> within a td is there a way to pick it up as well and have it included within the respective group? Was an edge case that came up.

Martijn Pieters Over a year ago

@user1982011: use group.extend(link['vid'] for link in tr.find_all('a', vid=True)) to add them all; that's the whole else block.

Collectives™ on Stack Overflow

Using BeautifulSoup with Python to parse page for attribute values

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related