Find Specific Text Within HTML Tag in Python

Question

I've tried a million different ways to parse out the zestimate, but have yet to be successful.

here's the html tag with the zestimate info:

<span>
  <span tabindex="0" role="button">
    <span class="sc-bGbJRg iiEDXU ds-dashed-underline">
      Zestimate
    <sup>®</sup>
    </span>
  </span>
  :&nbsp;
  <span>$331,425</span>
</span>

Honestly I thought this would get me close, but I get an empty list:

link = 'https://www.zillow.com/homedetails/1404-Clearwing-Cir-Georgetown-TX-78626/121721750_zpid/'
searched_word = '<span class="sc-bGbJRg iiEDXU ds-dashed-underline">Zestimate<sup>®</sup></span>'
test_page = requests.Session().get(link, headers=req_headers)
test_soup = BeautifulSoup(test_page.content, 'lxml')
results = test_soup('span',string='searched_word')
print(results)[0]

Andrej Kesely · Accepted Answer · 2020-06-13 23:50:34Z

1

To get correct HTML from the site, add User-Agent header to request.

For example:

import requests
from bs4 import BeautifulSoup


url = 'https://www.zillow.com/homedetails/1404-Clearwing-Cir-Georgetown-TX-78626/121721750_zpid/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

home_value = soup.select_one('h4:contains("Home value")').find_next('p').get_text(strip=True)
print(home_value)

Prints:

$331,425

answered Jun 13, 2020 at 23:50

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

max Over a year ago

I keep getting "AttributeError: 'NoneType' object has no attribute 'find_next'"

Andrej Kesely Over a year ago

@max Try to do print(soup) and verify that you don't get captcha page

eNc Over a year ago

@max Sounds like you may be getting captcha. Modify @AndrejKesely code so that

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0', 'content-type': 'text/html; charset=UTF-8'}

and see if that works for you.

max Over a year ago

@AndrejKesely ah dang, didn't take the time to read the results of that soup variable. it was a captcha page. I added a different header and it worked: req_headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.8', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' }

max Over a year ago

@AndrejKesely Thanks so much. you have no idea how much time i spent trying to figure out both of those issues. i got nonetype all the time and didn't realize why. at this point i'm over this whole thing. But anyways, thanks again.

|

Collectives™ on Stack Overflow

Find Specific Text Within HTML Tag in Python

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related