How to Ignore html comment tag in regex through python

Question

I am replacing special character with some asci code and ignoring html tags with the help of below regex

text_list = re.findall(r'>([\S\s]*?)<', html)

So it is ignoring all html tags as we want it but is not ignoring html comment closing tag "-->".

Any help appreciated. What should I changed in regex.

Attached screenshot for your reference.

Why are you using regular expressions with HTML? Use an HTML parser such as BeautifulSoup. — user5386938
– user5386938, Commented Jun 8, 2021 at 19:50
We can not use soup because I am returning modification file. — VJ_Bravo
– VJ_Bravo, Commented Jun 9, 2021 at 5:03
You speak of "ignoring" and "replacing" but it doesn't fully explain what you're doing with your regex. You need to show more code or explain it better. — Patrick Parker
– Patrick Parker, Commented Jun 9, 2021 at 5:30
Could you please give me example where I can replace special character with asci code with help of beautifulSoup . Would really helpful.. — VJ_Bravo
– VJ_Bravo, Commented Jun 10, 2021 at 14:57

Dhyn amicable · Accepted Answer · 2021-06-19 14:40:00Z

1

Please try whil read the file please pass the multiple encoding parameters

answered Jun 19, 2021 at 14:40

Dhyn amicable

879 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Wiktor Stribiżew · Accepted Answer · 2021-06-08 19:40:34Z

1

You may match them and discard using re.findall:

text_list = list(filter(None, re.findall(r'(?s)<!--.*?-->|>(.*?)<', html)))
# Or, a bit more efficient:
text_list = list(filter(None, re.findall(r'<!--[^-]*(?:-(?!->)[^-]*)*-->|>([^<]*)<', html)))

See this regex demo (and the second one).

The regex matches substrings between  and matches substrings between < and >, capturing the text between the two latter delimiters into Group 1 and re.findall only returns the captures if there is a capturing group in the pattern.

See the Python demo:

import re
html = "<a href='link.html'>URL</a>Some text <!-- Comment --><p>Par here</p>More text"
text_list = list(filter(None, re.findall(r'(?s)<!--.*?-->|>(.*?)<', html)))
print(text_list)
# => ['URL', 'Some text ', 'Par here']

answered Jun 8, 2021 at 19:40

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

2 Comments

VJ_Bravo Over a year ago

Thanks for your reply, I have tried it but it is not working.. it is giving the same output.

Wiktor Stribiżew Over a year ago

@VishalJ That means you either have a different input or you are not actually using my solution. Please use ideone.com/hVQi1F to show me what code and input you have if you need more help (click fork, edit the code, run and then share the new link with me).

Collectives™ on Stack Overflow

How to Ignore html comment tag in regex through python

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related