4

I'm trying to remove HTML tags (Python 3) but also trying to remove the text in between them. My below code snippet doesn't seem to give me the result I'm looking for and all the other questions I've found on SO seem to only look at removing the HTML tags but preserving the text inside the HTML tag which is not what I'm trying to do.

Current Code

import re
...
price="12.00 <b>17.50</b>"
price=re.sub('<[^>]*>', '', price)

String

12.00 <b>17.50</b>

Expected Outcome

12.00

Current Outcome

12.00 17.50

2 Answers 2

6

You can also do it with an HTML Parser, like BeautifulSoup. The idea is to find all the tags and decompose them, then get what is left:

In [8]: from bs4 import BeautifulSoup

In [9]: price = "12.00 <b>17.50</b>"

In [10]: soup = BeautifulSoup(price, "html.parser")

In [11]: for elm in soup.find_all():
    ...:     elm.decompose()
    ...:     

In [12]: print(soup)
12.00 

And, here is a famous topic explaining why you should not process HTML with regular expressions:

Sign up to request clarification or add additional context in comments.

Comments

0

A possible solution is go one by one tag, for exaple, clean all inside <b></b>:

price=re.sub("<[b][^>]*>(.+?)</[b]>", '', price)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.