2

Say I have a string:

"<blockquote>Quote</blockquote><br />text <h3>This is a title</h3>"

Expected Output:

["<blockquote>Quote</blockquote><br />", "text", "<h3>This is a title</h3>"]

I need both the opening and closing tags to be included in the same item, as above.

I've tried: re.split("<*>*</*>", s)

I'm quite new with Regex so any help is appreciated

6
  • 4
    Possible duplicate of RegEx match open tags except XHTML self-contained tags Commented Jul 12, 2018 at 20:02
  • 1
    Don't use regex to parse HTML :) stackoverflow.com/questions/1732348/… Commented Jul 12, 2018 at 20:19
  • "Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. " Commented Jul 12, 2018 at 20:22
  • 1
    "Even Jon Skeet cannot parse HTML using regular expressions." Commented Jul 12, 2018 at 20:23
  • 2
    A more explanatory article on "why not use Regex": blog.codinghorror.com/parsing-html-the-cthulhu-way Most importantly: Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy (and use another library) Commented Jul 12, 2018 at 20:27

1 Answer 1

8

You can use re.findall to do this.

import re
s = "<blockquote>Quote</blockquote><br />text <h3>This is a title</h3>"
re.findall(r'<[^>]*>.*?</[^>]*>(?:<[^>]*/>)?|[^<>]+', s)
# ['<blockquote>Quote</blockquote><br />', 'text ', '<h3>This is a title</h3>']

But avoid parsing html data like directly using regex and consider using something like BeautifulSoup

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.findAll()
[<blockquote>Quote</blockquote>, <br/>, <h3>This is a title</h3>]
>>> soup.findAll()[0].text
'Quote'
>>> [s for s in soup.strings]
['Quote', 'text ', 'This is a title']
Sign up to request clarification or add additional context in comments.

3 Comments

Could you direct me to a Beautiful Soup solution?
Updated the answer
Works. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.