1

Hi All/Python'ers/RegEx'ers,

I'm working lab exercise, learning Python RE package. I've got this data. I want to grab only the data between HTML tags. I tried this "[^(</?\w+>)]\d+" i.e. exclude all HTML tags TBODY or TD or /TD etc

It misses out first data 1850

<TBODY><TR><TD>1850</TD><TD>John</TD><TD>-0.339</TD><TD>-0.425</TD></TR></TBODY>

regex101 link

I'm trying

re.findall("[^(<\/?\w+>)]\d+", html_line)

Trying this "(<\/?\w+>)" grouping gets me all the HTML tags, I just to exclude ALL HTML tags, just opposite, so, I tried [^(<\/?\w+>)]

Thanks in Advance. N. PS: Part of problem is, I shouldn't be using BeautifulSoup

1 Answer 1

1

You should in general be using a package such as Beautiful Soup, which was designed to parse and handle HTML/XML content. Using pure regex against HTML is not ideal, but you may try the following:

inp = "<TBODY><TR><TD>1850</TD><TD>-0.373</TD><TD>-0.339</TD><TD>-0.425</TD></TR></TBODY>"
matches = re.findall(r'<([^>]+)>(-?\d+(?:\.\d+)?)</\1>', inp)
print([i[1] for i in matches])

This prints:

['1850', '-0.373', '-0.339', '-0.425']

Here is an explanation of the regex used:

<([^>]+)>          match an opening HTML tag, and capture the tag label in \1
(-?\d+(?:\.\d+)?)  then match and capture a positive/negative number, with optional decimal
</\1>              match a closing HTML tag idential to what opened
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.