0

I am trying to use Python with BeautifulSoup to go through a page that has sections with ids that are incrementing in value by 1, and I am trying to get their vids. However the # of vids are variable depending on the span id as you can see below, also it is not nested under the original tr.

Right now I am doing a loop to get the span id value, however I am trying to figure out a way to get the vid values as an array for each span id.

The following is an example html I am working with:

<tr>
    <td>
        <div>
            <span class="apple-font" id="001">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099882"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>


<tr>
    <td>
        <div>
            <span class="apple-font" id="002">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="003">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="004">
        </div>
    </td>
</tr>

<tr>
</tr>

The following is code I am using / have been trying to but have not made much progress yet on figuring out getting all the vids:

soup = soup.findAll(class_="apple-font", id=True)
for s in soup:       
   n = str(s.get_text().lstrip().replace(".",""))
   print n
print 
2
  • Are these all in the same table? Commented Mar 9, 2015 at 16:40
  • yes they are all in the same table, thanks! Commented Mar 9, 2015 at 16:41

1 Answer 1

1

I'd use an iterative approach; loop over all tr elements in the same table, starting from the first <span class="apple-font"> tag and start a new group each time you find a row with a new id:

table = soup.find(class_='apple-font', id=True).find_parent('table')
groups = {}
group = None
for tr in table.find_all('tr'):
    id_span = tr.find(class_='apple-font', id=True)
    if id_span is not None:
        # new group
        group = []
        groups[id_span['id']] = group
    else:
        vid_link = tr.find('a', vid=True)
        if vid_link is not None:
            group.append(vid_link['vid'])

Demo:

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="001">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099882"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="002">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="003">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="004">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... '''
>>> soup = BeautifulSoup('<table>{}</table>'.format(sample))
>>> table = soup.find(class_='apple-font', id=True).find_parent('table')
>>> groups = {}
>>> group = None
>>> for tr in table.find_all('tr'):
...     id_span = tr.find(class_='apple-font', id=True)
...     if id_span is not None:
...         # new group
...         group = []
...         groups[id_span['id']] = group
...     else:
...         vid_link = tr.find('a', vid=True)
...         if vid_link is not None:
...             group.append(vid_link['vid'])
... 
>>> print groups
{'003': ['0099883', '0099883'], '002': ['0099883'], '001': ['0099882', '0099883', '0099883'], '004': []}
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks a lot for this, it works well. A question I had is I have some other trs and tables above it with additional information. As a result, I am getting an error "AttributeError: 'NoneType' object has no attribute 'append'" I was wondering if there is some way it can look at these trs specifically i.e. items with id=00..? or is it already doing that? When I delete all the trs / tables above it, it works fine.
@user1982011: that means you are seeing <a vid="..."> links before the first <span class="apple-font" id="..."> row is showing up. Rather than set groups = {} and group = None you can use group = [], groups = {'no id': group} and collect all those in a separate list keyed on 'no id'.
Thank you this works perfectly! One last question I had for you is if there are multiple <a vid="#####"></a> within a td is there a way to pick it up as well and have it included within the respective group? Was an edge case that came up.
@user1982011: use group.extend(link['vid'] for link in tr.find_all('a', vid=True)) to add them all; that's the whole else block.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.