0

I am using python regular expression to parse html file, now I need to extract a number from a html tag, the number can be either integer or floating point value. Following are two examples:

integer case:

<span class='addr-bbs'>2 baths</span>

floating point case:

<span class='addr-bbs'>3.5 baths</span>

My original code is:

bath = re.findall('<span class=\"addr_bbs\">' + '(.{1,3})' + 'baths{0,1}<', str(homedata))

But after testing, it misses all the floating point cases. How can I cover both cases to extract the number correctly?

Thanks

3

3 Answers 3

1

As commented, use a html parser to find the tags by class name. If the number is always the first in the text you can just split to extract it once you have the tag:

from bs4 import BeautifulSoup
h = """<span class='addr-bbs'>3.5 baths</span>
      <span class='addr-bbs'>1 baths</span>
      <span class='foos'>3.0 baths</span>"""

soup = BeautifulSoup(h,"html.parser")

for span in soup.select("span.addr-bbs"):
    print(span.text.split()[0])

Which would print:

3.5
1

If you want to also filter by the tag text, i.e there are other spans with the addr-bbs, you can pass a regex to find_all to only get the span.addr-bbs that have the word baths.

from bs4 import BeautifulSoup
import re
h = """<span class='addr-bbs'>3.5 baths</span>
"<span class='addr-bbs'>5 rooms</span>
      <span class='addr-bbs'>1 baths</span>
      <span class='foos'>3.0 baths</span>"""

soup = BeautifulSoup(h, "html.parser")

for span in soup.find_all("span","addr-bbs", text=re.compile(r"\bbaths\b")):
    print(span.text.split()[0])
Sign up to request clarification or add additional context in comments.

1 Comment

You are probably right, regx may not be a good idea in a long terms. I need to redo the whole thing with BeautifulSoup.
0

First, realize you are somewhat doomed without more processing. Some realtors will write "2.5", others "2 1/2", others "2+1/2", and so on. MLS data has never normalized, in part to make it difficult to parse. Just when you think you have it solved, you get "2+sink". It's usually permissible to guess the numeric meaning for searches and then spit out the original text when its displayed.

You should probably grab everything from the > to baths. To do this correctly, you should use the "non-greedy" modify, so that you don't parse all the way down to the next record. You can read non-greedy in thsi Python doc, but the magic phrase is:

bath = re.findall('<span class=\"addr_bbs\">(.*?)bath.?<', str(homedata))

Then try to parse bath.groups() best you can.

Comments

0

Three typos:

  • the inverted commas;
  • the dash;
  • the space.

Try with bath = re.findall('''<span class=["']addr-bbs["']>''' + '(.{1,3})' + ' baths{0,1}<', str(homedata))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.