1

I have the following HTML code I am trying to scrape from a website:

<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>

What I am trying to accomplish is to search the page to find the text "Net Taxes Due" within the tag, find the siblings of the tag, and send the results into a Pandas data frame.

I have the following code:

soup = BeautifulSoup(url, "html.parser")
table = soup.select('#Net Taxes Due')

cells = table.find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]

df = pd.DataFrame(np.array(cells))

print(df)

I've been all over the web looking for a solution and can't come up with something. Appreciate any help.

Thanks!

0

5 Answers 5

1

In the following I expected to use indices 1 and 2 but 2 and 3 seems to work when using lxml.html and xpath

import requests
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>

'''
tree = fromstring(h)
links = [link.text for link in tree.xpath('//td[text() = "Net Taxes Due"]/following-sibling::td[2] | //td[text() = "Net Taxes Due"]/following-sibling::td[3]' )]
print(links)
Sign up to request clarification or add additional context in comments.

Comments

0

Make sure to add the tag name along with your search string. This is how you can do that:

from bs4 import BeautifulSoup

htmldoc = """
<tr>
    <td>Net Taxes Due</td>
    <td class="value-column">$2,370.00</td>
    <td class="value-column">$2,408.00</td>
</tr>
"""    
soup = BeautifulSoup(htmldoc, "html.parser")
item = soup.find('td',text='Net Taxes Due').find_next_sibling("td")
print(item)

3 Comments

I get the following error: 'NoneType' object has no attribute 'find_next_sibling'
Then you must be handling stuffs in the wrong way. As a proof of concept try executing the above snippet.
Thanks! Your solution works perfectly. Turns out I'm an idiot and I had an error in the URL I was pulling from. No wonder it wasn't returning any values...
0

Your .select() call is not correct. # in a selector is used to match an element's ID, not its text contents, so #Net means to look for an element with id="Net". Spaces in a selector mean to look for descendants that match each successive selector. So #Net Taxes Due searches for something like:

<div id="Net">
    <taxes>
        <due>...</due>
    </taxes>
</div>

To search for an element containing a specific string, use .find() with the string keyword:

table = soup.find(string="Net Taxes Due")

3 Comments

Thanks. I run that using the string; however, I come up with an error when I try to find the siblings of that tag. cells = table.find_next_siblings('td') AttributeError: 'NoneType' object has no attribute 'find_next_siblings'
That means it's not finding the Net Taxes Due element, so find() returns None.
Was this before or after you fixed the URL?
0

Assuming that there's an actual HTML table involved:

<html>
<table>
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
</table>
</html>

soup = BeautifulSoup(url, "html.parser")
table = soup.find('tr')
df = [x.text for x in table.findAll('td', {'class':'value-column'})]

1 Comment

Look up list comprehension. In this case x represent every individual tag found by the findAll method.
0

These should work. If you are using bs4 4.7.0, you "could" use select. But if you are on an older version, or just prefer the find interface, you can use that. Basically as stated earlier, you cannot reference content with #, that is an ID.

import bs4

markup = """
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
"""

# Version 4.7.0
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.select('td:contains("Net Taxes Due") ~ td.value-column')
cells = [ele.text.strip() for ele in cells]
print(cells)

# Version < 4.7.0 or if you prefer find
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.find('td', text="Net Taxes Due").find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
print(cells)

You would get this

['$2,370.00', '$2,408.00']
['$2,370.00', '$2,408.00']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.