HTML parsing with python regular expression

Question

I am using python regular expression to parse html file, now I need to extract a number from a html tag, the number can be either integer or floating point value. Following are two examples:

integer case:

<span class='addr-bbs'>2 baths</span>

floating point case:

<span class='addr-bbs'>3.5 baths</span>

My original code is:

bath = re.findall('<span class=\"addr_bbs\">' + '(.{1,3})' + 'baths{0,1}<', str(homedata))

But after testing, it misses all the floating point cases. How can I cover both cases to extract the number correctly?

Thanks

Please don't parse HTML with regex, it's gonna hurt you. You're using Python already, why not use BeautifulSoup? crummy.com/software/BeautifulSoup/bs4/doc — 1sloc
– 1sloc, Commented Jul 11, 2016 at 19:50
Possible duplicate of RegEx match open tags except XHTML self-contained tags — Two-Bit Alchemist
– Two-Bit Alchemist, Commented Jul 11, 2016 at 19:51

Padraic Cunningham · Accepted Answer · 2016-07-11 20:01:49Z

1

As commented, use a html parser to find the tags by class name. If the number is always the first in the text you can just split to extract it once you have the tag:

from bs4 import BeautifulSoup
h = """<span class='addr-bbs'>3.5 baths</span>
      <span class='addr-bbs'>1 baths</span>
      <span class='foos'>3.0 baths</span>"""

soup = BeautifulSoup(h,"html.parser")

for span in soup.select("span.addr-bbs"):
    print(span.text.split()[0])

Which would print:

3.5
1

If you want to also filter by the tag text, i.e there are other spans with the addr-bbs, you can pass a regex to find_all to only get the span.addr-bbs that have the word baths.

from bs4 import BeautifulSoup
import re
h = """<span class='addr-bbs'>3.5 baths</span>
"<span class='addr-bbs'>5 rooms</span>
      <span class='addr-bbs'>1 baths</span>
      <span class='foos'>3.0 baths</span>"""

soup = BeautifulSoup(h, "html.parser")

for span in soup.find_all("span","addr-bbs", text=re.compile(r"\bbaths\b")):
    print(span.text.split()[0])

edited Jul 11, 2016 at 20:01

answered Jul 11, 2016 at 19:55

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

DQI Over a year ago

You are probably right, regx may not be a good idea in a long terms. I need to redo the whole thing with BeautifulSoup.

Charles Merriam · Accepted Answer · 2016-07-11 20:01:14Z

First, realize you are somewhat doomed without more processing. Some realtors will write "2.5", others "2 1/2", others "2+1/2", and so on. MLS data has never normalized, in part to make it difficult to parse. Just when you think you have it solved, you get "2+sink". It's usually permissible to guess the numeric meaning for searches and then spit out the original text when its displayed.

You should probably grab everything from the > to baths. To do this correctly, you should use the "non-greedy" modify, so that you don't parse all the way down to the next record. You can read non-greedy in thsi Python doc, but the magic phrase is:

bath = re.findall('<span class=\"addr_bbs\">(.*?)bath.?<', str(homedata))

Then try to parse bath.groups() best you can.

logi-kal · Accepted Answer · 2016-07-11 20:04:38Z

0

Three typos:

the inverted commas;
the dash;
the space.

Try with bath = re.findall('''<span class=["']addr-bbs["']>''' + '(.{1,3})' + ' baths{0,1}<', str(homedata))

edited Jul 11, 2016 at 20:04

answered Jul 11, 2016 at 19:59

logi-kal

7,8996 gold badges35 silver badges48 bronze badges

Collectives™ on Stack Overflow

HTML parsing with python regular expression

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related