0

i want to write a regular expression that extracts the text out of all html elements of class important from that string:

text = """Lorem ipsum dolor <b>sit</b> amet, <b class="important">consectetur adipiscing</b> elit,\ 
sed do eiusmod <span id="note">tempor incididunt ut</span> <div>labore <strong class="important">\
et dolore magna</strong> aliqua.</div> Ut enim ad minim veniam, quis nostrud exercitation ullamco."""

do you have idea for a simple and short answer? thank you!

3
  • 1
    Take a look at this answer and How to Ask. Commented May 6, 2020 at 17:04
  • 1
    Use either beautiful soup to parse the html file and extract the required content. Their documentation is really easy to understand, and the library itself easy to use. Commented May 6, 2020 at 17:26
  • Don't use regex for html parsing. It WILL give you headaches. Comment #1 above hints at why it's a bad idea, comment #2 above suggests a solution you should definitely consider. Commented May 6, 2020 at 17:57

1 Answer 1

2

try using this regex pattern

<[^>]*class="important"[^>]*>[^>]*<\/[^>]*>

if you want to remove the tags you can use regex substitute with the pattern:

<\/{0,1}[^>]*>

if you want to make patterns try out https://regexr.com its a great site that highlight the matches making things easier. please do mark as answer if this helps you.

Sign up to request clarification or add additional context in comments.

2 Comments

re.sub(pattern, new_string, input_string) in pattern is your regex pattern new_string is what you want to replace with input_string should be your text variable in this case
if you want to remove the html tags <>, try: output = re.sub(r"<\/{0,1}[^>]*>","",input_string) here we substitute the matched with empty string to remove it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.