Python regex - stripping out HTML tags and formatting characters from inner HTML

Question

I'm dealing with single HTML strings like this

>> s = 'u><br/>\n                                    Some text <br/><br/><u'

where I've got meaningful text embedded inside broken HTML or incomplete HTML tags. I need to extract only that inner text, and ignore the broken HTML. How can I do this? I'm using

>> re.search(r'(.>)(<.>)(.>)', s)
>>

but this returns null.

Pierce Darragh · Accepted Answer · 2016-12-09 17:28:13Z

1

If I understand you right, you're looking to take this input:

u><br/>\n                                    Some text <br/><br/><u

And receive this output:

\n                                    Some text

This is done simply enough by only caring about what comes between the two inward-pointing brackets. We want:

A right-bracket > (so we know where to begin)
Some text \n Some text (the content) which does not contain a left-bracket
A left-bracket < (so we know where to end)

You want:

>>> s = 'u><br/>\n                                    Some text <br/><br/><u'
>>> re.search(r'>([^<]+)<', s)
<_sre.SRE_Match object; span=(6, 55), match='>\n                                    Some text >

(The captured group can be accessed via .group(1).)

Additionally, you may want to use re.findall if you expect there to be multiple matches per line:

>>> re.findall(r'>([^<]+)<', s)
['\n                                    Some text ']

EDIT: To address the comment: If you have multiple matches and you want to connect them into a single string (effectively removing all HTML-like tag things), do:

>>> s = 'nbsp;<br><br>Some text.<br>Some \n more text.<br'
>>> ' '.join(re.findall(r'>([^<]+)<', s))
'Some text. Some \n more text.'

edited Dec 9, 2016 at 17:28

answered Dec 9, 2016 at 16:23

Pierce Darragh

2,2203 gold badges18 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

srm Over a year ago

Thanks, but what if s is something like nbsp;<br><br>Some text.<br>Some \n more text.<br, and I need to strip out all the HTML and formatting to just get Some text. Some \n more text.?

Collectives™ on Stack Overflow

Python regex - stripping out HTML tags and formatting characters from inner HTML

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related