Parsing rows of information from html using python

Question

I wanted to find a way to parse information from:

<tr>
   <td class="prodSpecAtribute">Rulebook Chapter</td>
   <td colspan="5">
     <a href="http://cmegroup.com/rulebook/CME/V/450/452/452.pdf" target="_blank" title="CME Chapter 452">CME Chapter 452</a>
   </td>
</tr>

<tr>
   <td class="prodSpecAtribute" rowspan="2">
      Trading Hours
      <br>
      (All times listed are Central Time)
   </td>
   <td>OPEN OUTCRY</td>
   <td colspan="4">
      <div class="font_black Large_div_td">MON-FRI: 7:20 a.m. - 2:00 p.m.</div>
   </td>
</tr>
<tr>
   <td>CME GLOBEX</td>  #PROBLEM HERER -- WANT this and  div below to be one row, considered under class <td class="prodSpecAtribute" rowspan="2"> ... Trading Hours... 

   <td colspan="4">
      <div class="font_black Large_div_td">SUN - FRI: 5:00 p.m. - 4:00 p.m. CT</div>
   </td>
</tr>

I was able to parse information in the top table easily as follows:

soup = BeautifulSoup(page)
left_col = soup.findAll('td', attrs={'class' : 'prodSpecAtribute'})
right_col= soup.findAll('td', colspan=['4', '5'])

So in this example there are 3 rows: 2 have class "prodSpecAtribute" and atleast one column corresponding to each class. However, the last row, has no class, so I need a way to use the last class and define this new under the same class, along with the 2 of the third row's <td>s: CME GLOBEX and SUN - FRI: 5:00 p.m. - 4:00 p.m. CT

Combine_column method:

def combine_col(right):
    num = len(right)

    for i in range(0, num):
        text_ = ' '.join(right[i].findAll(text=True))
        print text_

    return text_

abarnert · Accepted Answer · 2013-05-22 01:09:30Z

1

The obvious way to merge the second and third columns of the second row is to iterate over the rows explicitly. Anything you write with find_all is just going to return row0-col1, row1-col1, and row1-col2 as three separate values, and you'll have no way of knowing which ones go together.

So, if I understand your problem, you want something like this:

left_col = []
right_col = []
for tr in soup.find_all('tr'):
    tds = tr.find_all('td')
    left, right = tds[0], tds[1:]
    assert('prodSpecAtribute' in left['class'])
    left_col.append(left)
    right_col.append(combine_columns(right))

Except that you need to write that combine_columns code, because I don't know how you want to "combine the information" in the columns.

I'm obviously using the rule that column 0 goes in the left, rather than whatever column has class prodSpecAttribute. I did this mainly because I can't figure out what you'd want to happen for a row that had no such column, or where it wasn't the leftmost column. So, I just added an assert for sanity checking, to verify that this is always the right rule for your source.

answered May 22, 2013 at 1:09

abarnert

368k54 gold badges626 silver badges691 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

James Hallen Over a year ago

Sorry, what I meant by combine was to append the texts the found in tds[0] tds[1] ...

James Hallen Over a year ago

Could you take a look at the post again, I forgot to add the last <tr> which is difficult to figure out

abarnert Over a year ago

You really need to give a much less vague description if you want more exact answers. Or, better, instead of describing what you want, just show the exact output you want.

Collectives™ on Stack Overflow

Parsing rows of information from html using python

Combine_column method:

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Combine_column method:

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related