Python findall grabbing data within HTML tags using regular expression

Question

Hi All/Python'ers/RegEx'ers,

I'm working lab exercise, learning Python RE package. I've got this data. I want to grab only the data between HTML tags. I tried this "[^(</?\w+>)]\d+" i.e. exclude all HTML tags TBODY or TD or /TD etc

It misses out first data 1850

<TBODY><TR><TD>1850</TD><TD>John</TD><TD>-0.339</TD><TD>-0.425</TD></TR></TBODY>

regex101 link

I'm trying

re.findall("[^(<\/?\w+>)]\d+", html_line)

Trying this "(<\/?\w+>)" grouping gets me all the HTML tags, I just to exclude ALL HTML tags, just opposite, so, I tried [^(<\/?\w+>)]

Thanks in Advance. N. PS: Part of problem is, I shouldn't be using BeautifulSoup

Tim Biegeleisen · Accepted Answer · 2020-01-15 04:11:12Z

1

You should in general be using a package such as Beautiful Soup, which was designed to parse and handle HTML/XML content. Using pure regex against HTML is not ideal, but you may try the following:

inp = "<TBODY><TR><TD>1850</TD><TD>-0.373</TD><TD>-0.339</TD><TD>-0.425</TD></TR></TBODY>"
matches = re.findall(r'<([^>]+)>(-?\d+(?:\.\d+)?)</\1>', inp)
print([i[1] for i in matches])

This prints:

['1850', '-0.373', '-0.339', '-0.425']

Here is an explanation of the regex used:

<([^>]+)>          match an opening HTML tag, and capture the tag label in \1
(-?\d+(?:\.\d+)?)  then match and capture a positive/negative number, with optional decimal
</\1>              match a closing HTML tag idential to what opened

answered Jan 15, 2020 at 4:11

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python findall grabbing data within HTML tags using regular expression

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related