1

enter image description here

Im trying to get rid of the HTML tags, to an extent it works, but not all the tags are removed. But the below mentioned tags aren't gone

print('NOT DEALT WITH:')
for body in not_dealt_with_list:
#p = re.compile(r'<.*?[\\t\\n\\r\\s]*?.*?>')
    print(remove_tags(body))
    #print(p.sub('', body))
    #body = re.sub()

def remove_tags(content):
parser = lxml.html.HTMLParser(remove_comments=True, 
remove_blank_text=True)
document = lxml.html.document_fromstring(content, parser)
return document.text_content()
0

1 Answer 1

1

it looks like what you're trying to remove is embedded into a html comment (because it doesn't look like html there). Html comments start with and that's what you have to search for.

Try this regex to search for everything inside a comment to replace it afterwards over multiple lines

<!--(.|\n)*?-->

Let me know how it works out!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.