Remove html tags using Regex

Question

Im trying to get rid of the HTML tags, to an extent it works, but not all the tags are removed. But the below mentioned tags aren't gone

print('NOT DEALT WITH:')
for body in not_dealt_with_list:
#p = re.compile(r'<.*?[\\t\\n\\r\\s]*?.*?>')
    print(remove_tags(body))
    #print(p.sub('', body))
    #body = re.sub()

def remove_tags(content):
parser = lxml.html.HTMLParser(remove_comments=True, 
remove_blank_text=True)
document = lxml.html.document_fromstring(content, parser)
return document.text_content()

Andreas · Accepted Answer · 2019-07-17 08:28:33Z

1

it looks like what you're trying to remove is embedded into a html comment (because it doesn't look like html there). Html comments start with and that's what you have to search for.

Try this regex to search for everything inside a comment to replace it afterwards over multiple lines

<!--(.|\n)*?-->

Let me know how it works out!

answered Jul 17, 2019 at 8:28

Andreas

3041 silver badge15 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Remove html tags using Regex

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related