Python: Remove HTML Tags & text inbetween HTML Tags

Question

I'm trying to remove HTML tags (Python 3) but also trying to remove the text in between them. My below code snippet doesn't seem to give me the result I'm looking for and all the other questions I've found on SO seem to only look at removing the HTML tags but preserving the text inside the HTML tag which is not what I'm trying to do.

Current Code

import re
...
price="12.00 <b>17.50</b>"
price=re.sub('<[^>]*>', '', price)

String

12.00 <b>17.50</b>

Expected Outcome

12.00

Current Outcome

12.00 17.50

Community · Accepted Answer · 2017-05-23 12:02:21Z

6

You can also do it with an HTML Parser, like BeautifulSoup. The idea is to find all the tags and decompose them, then get what is left:

In [8]: from bs4 import BeautifulSoup

In [9]: price = "12.00 <b>17.50</b>"

In [10]: soup = BeautifulSoup(price, "html.parser")

In [11]: for elm in soup.find_all():
    ...:     elm.decompose()
    ...:     

In [12]: print(soup)
12.00

And, here is a famous topic explaining why you should not process HTML with regular expressions:

RegEx match open tags except XHTML self-contained tags

edited May 23, 2017 at 12:02

CommunityBot

11 silver badge

answered May 2, 2017 at 11:25

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

maiky_forrester · Accepted Answer · 2017-05-02 11:30:54Z

0

A possible solution is go one by one tag, for exaple, clean all inside <b></b>:

price=re.sub("<[b][^>]*>(.+?)</[b]>", '', price)

answered May 2, 2017 at 11:30

maiky_forrester

6086 silver badges23 bronze badges

Collectives™ on Stack Overflow

Python: Remove HTML Tags & text inbetween HTML Tags

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related