extract table from html by position using Python

Question

I want to extract a specific table from an html document that contains mutliple tables, but unfortunately there are no identifiers. There is a table title, however. I just can't seem to figure it out.

Here is an example html file

<BODY>
<TABLE>
<TH>
<H3>    <BR>TABLE 1    </H3>
</TH>
<TR>
<TD>Data 1    </TD>
<TD>Data 2    </TD>
</TR>
<TR>
<TD>Data 3    </TD>
<TD>Data 4    </TD>
</TR>
<TR>
<TD>Data 5    </TD>
<TD>Data 6    </TD>
</TR>
</TABLE>

<TABLE>
<TH>
<H3>    <BR>TABLE 2    </H3>
</TH>
<TR>
<TD>Data 7    </TD>
<TD>Data 8    </TD>
</TR>
<TR>
<TD>Data 9    </TD>
<TD>Data 10    </TD>
</TR>
<TR>
<TD>Data 11    </TD>
<TD>Data 12    </TD>
</TR>
</TABLE>
</BODY>

I can use beautifulSoup 4 to get tables by id or name, but I need just a single table that is only identifiable by position.

I know that I can get the first table with:

tmp = f.read()
soup = BeautifulSoup(tmp) ## make it readable
table = soup.find('table') ### gets first table

but how would I get the second table?

alecxe · Accepted Answer · 2015-03-10 20:43:17Z

2

You can rely on the table title.

Find the element by text passing a function as a text argument value, then get the parent:

table_name = "TABLE 1" 

table = soup.find(text=lambda x: x and table_name in x).find_parent('table')

answered Mar 10, 2015 at 20:43

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jedwards Over a year ago

Just curious, why x and? Surely bool(x) would return True if table_name in x did, no? Are you just short-circuiting for performance?

alecxe Over a year ago

@jedwards it's just a function argument, you can actually name it however you want, probably text would be a better choice.

alecxe Over a year ago

@jedwards we are checking for x since it can also be None which would cause a TypeError without this extra check.

jedwards Over a year ago

the second comment was what I was wondering about -- makes perfect sense.

exhoosier10 · Accepted Answer · 2015-03-10 21:46:45Z

0

If it's only identifiable by position, meaning it's always the 2nd table in the website, you could do:

tmp = f.read()
soup = BeautifulSoup(tmp)

# this will return the second table from the website
all_tables = soup.find_all('table')
second_table = all_tables[1]

edited Mar 10, 2015 at 21:46

answered Mar 10, 2015 at 20:54

exhoosier10

1214 silver badges8 bronze badges

Collectives™ on Stack Overflow

extract table from html by position using Python

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related