1

I am trying to extract a table from a web page. Below is the HTML and Python code using beautifulsoup. The code below always worked for me, but in this case I get blank. Thanks in advance.

<table>
<thead>
<tr>
<th>Period Ending:</th>
<th class="TalignL">Trend</th>
<th>9/27/2014</th>
<th>9/28/2013</th>
<th>9/29/2012</th>
<th>9/24/2011</th>
</tr>
</thead>
<tr>
<th bgcolor="#E6E6E6">Total Revenue</th>
<td class="td_genTable"><table border="0" align="center" width="*" cellspacing="0" cellpadding="0"><tr><td align="bottom"><table border="0" height="100%" cellspacing="0" cellpadding="0"><tr><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="15" bgcolor="#47C3D3" width="6"></td><td height="15" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="1" bgcolor="#FFFFFF" width="6"></td><td height="1" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="14" bgcolor="#47C3D3" width="6"></td><td height="14" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="2" bgcolor="#FFFFFF" width="6"></td><td height="2" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="13" bgcolor="#47C3D3" width="6"></td><td height="13" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="7" bgcolor="#FFFFFF" width="6"></td><td height="7" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="8" bgcolor="#47C3D3" width="6"></td><td height="8" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="1" bgcolor="#D1D1D1"></td></tr></table></td></tr></table></td></tr></table></td>
<td>$182,795,000</td>
<td>$170,910,000</td>
<td>$156,508,000</td>
<td>$108,249,000</td>
    rows = table.findAll('tr')

    for row in rows:
        cols = row.findAll('td')
        col1 = [ele.text.strip().replace(',','') for ele in cols]

        account = col1[0:1]
        period1 = col1[2:3]
        period2 = col1[3:4]
        period3 = col1[4:5]

        record = (stock, account,period1,period3,period3)
        print record
3
  • 1
    Your first column of your first non-header row contains a table full of empty cells with no text in them. Your code is correctly finding that no text. I'm not sure what you wanted it to do instead. Commented May 10, 2015 at 17:29
  • Meanwhile, why are you using the deprecated name findAll? Are you learning from sample code written for BS3 instead of from updated samples or documentation for BS4? Commented May 10, 2015 at 17:34
  • Finally, find_all (or findAll) searches through all descendants, not just the top-level children. So, unless you want to iterate through both the rows of the outer table and the rows of every subtable embedded inside a column of that table and treat them the same, you shouldn't be using it here. Commented May 10, 2015 at 17:35

2 Answers 2

2

Adding to what @abarnert pointed out. I would get all the td elements with text starting with $:

for row in soup.table.find_all('tr', recursive=False):
    record = [td.text.replace(",", "") for td in row.find_all("td", text=lambda x: x and x.startswith("$"))]
    print record

For the input you've provided, it prints:

[u'$182795000', u'$170910000', u'$156508000', u'$108249000']

which you can "unpack" into separate variables:

account, period1, period2, period3 = record

Note that I'm explicitly passing recursive=False to avoid going deeper in the tree and get only direct tr children of the table element.

Sign up to request clarification or add additional context in comments.

Comments

1

Your first problem is that find_all (or findAll, which is just a deprecated synonym for the same thing) doesn't just find the rows in the table, it finds the rows in the table and in every subtable within it. You almost certainly don't want to iterate over both kinds of rows and run the same code on each one. If you don't want that, as the recursive argument docs say, pass recursive=False.

So, now you get back only one row. If you do row.find_all('td'), that's going to have the same problem again—you're going to find all of the columns of this row, and all of the columns of every row in every subtable within one of those columns. Again, that's not what you want, so use recursive=False.

And now you get back only 5 columns. The first one is just a big table with a bunch of empty cells in it; the others, on the other hand, have dollar values in them, which seem to be the ones you want.


So, just adding recursive=False to both calls, and setting stock to something (I don't know where it's supposed to come from in your code, but without it you're obviously going to just get a NameError):

stock = 'spam'

rows = table.find_all('tr', recursive=False)

for row in rows:
    cols = row.findAll('td', recursive=False)
    col1 = [ele.text.strip().replace(',','') for ele in cols]

    account = col1[0:1]
    period1 = col1[2:3]
    period2 = col1[3:4]
    period3 = col1[4:5]

    record = (stock, account,period1,period3,period3)

    print record

This will print:

('spam', [''], ['$170910000'], ['$108249000'], ['$108249000'])

I'm not sure why you used period3 twice and never used period2, why you skipped over column 1 entirely, or why you sliced 1-element lists instead of just indexing the values, but anyway, this seems to be what you were trying to do.


As a side note, if you actually want to break out the list into 5 values, rather than into 4 1-element lists skipping one of the values, you can write:

account, whatever, period1, period2, period3 = col

2 Comments

@alecxe: To write books, you have to be able to edit things down. My chapter 1 would be 1300 pages. (That might work for a novel, but Tristam Shandy was already written 250 years ago…)
This worked!! Thank you. You are right. I am using stock as variable to loop through multiple stocks. Thanks again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.