Find data within HTML tags using Python

Question

I have the following HTML code I am trying to scrape from a website:

<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>

What I am trying to accomplish is to search the page to find the text "Net Taxes Due" within the tag, find the siblings of the tag, and send the results into a Pandas data frame.

I have the following code:

soup = BeautifulSoup(url, "html.parser")
table = soup.select('#Net Taxes Due')

cells = table.find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]

df = pd.DataFrame(np.array(cells))

print(df)

I've been all over the web looking for a solution and can't come up with something. Appreciate any help.

Thanks!

QHarr · Accepted Answer · 2019-01-04 20:45:13Z

1

In the following I expected to use indices 1 and 2 but 2 and 3 seems to work when using lxml.html and xpath

import requests
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>

'''
tree = fromstring(h)
links = [link.text for link in tree.xpath('//td[text() = "Net Taxes Due"]/following-sibling::td[2] | //td[text() = "Net Taxes Due"]/following-sibling::td[3]' )]
print(links)

answered Jan 4, 2019 at 20:45

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

SIM · Accepted Answer · 2019-01-04 20:49:54Z

0

Make sure to add the tag name along with your search string. This is how you can do that:

from bs4 import BeautifulSoup

htmldoc = """
<tr>
    <td>Net Taxes Due</td>
    <td class="value-column">$2,370.00</td>
    <td class="value-column">$2,408.00</td>
</tr>
"""    
soup = BeautifulSoup(htmldoc, "html.parser")
item = soup.find('td',text='Net Taxes Due').find_next_sibling("td")
print(item)

answered Jan 4, 2019 at 20:49

SIM

22.5k6 gold badges45 silver badges116 bronze badges

3 Comments

Smockrun Over a year ago

I get the following error: 'NoneType' object has no attribute 'find_next_sibling'

SIM Over a year ago

Then you must be handling stuffs in the wrong way. As a proof of concept try executing the above snippet.

Smockrun Over a year ago

Thanks! Your solution works perfectly. Turns out I'm an idiot and I had an error in the URL I was pulling from. No wonder it wasn't returning any values...

Barmar · Accepted Answer · 2019-01-04 20:27:55Z

0

Your .select() call is not correct. # in a selector is used to match an element's ID, not its text contents, so #Net means to look for an element with id="Net". Spaces in a selector mean to look for descendants that match each successive selector. So #Net Taxes Due searches for something like:

<div id="Net">
    <taxes>
        <due>...</due>
    </taxes>
</div>

To search for an element containing a specific string, use .find() with the string keyword:

table = soup.find(string="Net Taxes Due")

answered Jan 4, 2019 at 20:27

Barmar

789k57 gold badges554 silver badges669 bronze badges

3 Comments

Smockrun Over a year ago

Thanks. I run that using the string; however, I come up with an error when I try to find the siblings of that tag. cells = table.find_next_siblings('td') AttributeError: 'NoneType' object has no attribute 'find_next_siblings'

Barmar Over a year ago

That means it's not finding the Net Taxes Due element, so find() returns None.

Barmar Over a year ago

Was this before or after you fixed the URL?

Chris · Accepted Answer · 2019-01-04 20:38:08Z

0

Assuming that there's an actual HTML table involved:

<html>
<table>
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
</table>
</html>

soup = BeautifulSoup(url, "html.parser")
table = soup.find('tr')
df = [x.text for x in table.findAll('td', {'class':'value-column'})]

answered Jan 4, 2019 at 20:38

Chris

16.3k3 gold badges26 silver badges41 bronze badges

1 Comment

Chris Over a year ago

Look up list comprehension. In this case x represent every individual tag found by the findAll method.

facelessuser · Accepted Answer · 2019-01-04 21:11:04Z

These should work. If you are using bs4 4.7.0, you "could" use select. But if you are on an older version, or just prefer the find interface, you can use that. Basically as stated earlier, you cannot reference content with #, that is an ID.

import bs4

markup = """
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
"""

# Version 4.7.0
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.select('td:contains("Net Taxes Due") ~ td.value-column')
cells = [ele.text.strip() for ele in cells]
print(cells)

# Version < 4.7.0 or if you prefer find
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.find('td', text="Net Taxes Due").find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
print(cells)

You would get this

['$2,370.00', '$2,408.00']
['$2,370.00', '$2,408.00']

Collectives™ on Stack Overflow

Find data within HTML tags using Python

5 Answers 5

Comments

3 Comments

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

3 Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related