0

I am replacing special character with some asci code and ignoring html tags with the help of below regex

text_list = re.findall(r'>([\S\s]*?)<', html)

So it is ignoring all html tags as we want it but is not ignoring html comment closing tag "-->".

Any help appreciated. What should I changed in regex.

Attached screenshot for your reference.here in second screen you could see acsi code replaced

5
  • 1
    Why are you using regular expressions with HTML? Use an HTML parser such as BeautifulSoup. Commented Jun 8, 2021 at 19:50
  • We can not use soup because I am returning modification file. Commented Jun 9, 2021 at 5:03
  • You speak of "ignoring" and "replacing" but it doesn't fully explain what you're doing with your regex. You need to show more code or explain it better. Commented Jun 9, 2021 at 5:30
  • Who says you cannot modify a file using BeautifulSoup? Commented Jun 9, 2021 at 6:08
  • Could you please give me example where I can replace special character with asci code with help of beautifulSoup . Would really helpful.. Commented Jun 10, 2021 at 14:57

2 Answers 2

1

Please try whil read the file please pass the multiple encoding parameters

Sign up to request clarification or add additional context in comments.

Comments

1

You may match them and discard using re.findall:

text_list = list(filter(None, re.findall(r'(?s)<!--.*?-->|>(.*?)<', html)))
# Or, a bit more efficient:
text_list = list(filter(None, re.findall(r'<!--[^-]*(?:-(?!->)[^-]*)*-->|>([^<]*)<', html)))

See this regex demo (and the second one).

The regex matches substrings between <!-- and --> and matches substrings between < and >, capturing the text between the two latter delimiters into Group 1 and re.findall only returns the captures if there is a capturing group in the pattern.

See the Python demo:

import re
html = "<a href='link.html'>URL</a>Some text <!-- Comment --><p>Par here</p>More text"
text_list = list(filter(None, re.findall(r'(?s)<!--.*?-->|>(.*?)<', html)))
print(text_list)
# => ['URL', 'Some text ', 'Par here']

2 Comments

Thanks for your reply, I have tried it but it is not working.. it is giving the same output.
@VishalJ That means you either have a different input or you are not actually using my solution. Please use ideone.com/hVQi1F to show me what code and input you have if you need more help (click fork, edit the code, run and then share the new link with me).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.