Get text in neted html tags with regex in Python

Question

I have a text with html tags:

<p><b>Name and LastName</b><br />
Work Title<br /><span class="text-spacer"></span>
</p>

I want to have text in this format:

Name and LastName - Work Title

This is my code in Python but doesn't works:

text = '<p><b>Name and LastName</b><br />
    Work Title<br /><span class="text-spacer"></span>
    </p>'
my_text = re.sub(r'</b><br />', ' - ', text)

Do not try to parse html with regex. Use something like BeautifulSoup. — roganjosh
– roganjosh, Commented Oct 24, 2016 at 14:43

alecxe · Accepted Answer · 2016-10-24 14:45:08Z

3

I'd use a specialized tool for the job - an HTML Parser, like BeautifulSoup:

In [1]: from bs4 import BeautifulSoup

In [2]: data = """<p><b>Name and LastName</b><br />
    ...: Work Title<br /><span class="text-spacer"></span>
    ...: </p>"""

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: soup.p.get_text(separator=" - ", strip=True)
Out[4]: u'Name and LastName - Work Title'

Note the use of separator argument - it allows to provide a custom separator between the child nodes while getting the text of the parent - pretty neat feature that fits your use case nicely.

answered Oct 24, 2016 at 14:45

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

git-e Over a year ago

And if I have few items, this code return only first... @alecxe

Collectives™ on Stack Overflow

Get text in neted html tags with regex in Python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related