Remove unicode HTML tags in Python

Question

I have a string from which I would like to remove the HTML tags.

"overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......

I would like to just have

"overview":"WTS/VDI macOS.....

I tried with BeautifulSoap and Python Bleach, but it only recognizes if the tags are written in '<' and '>' format. Is there a library or any function which removes this for me? Or should I convert the unicode characters and do it manually?

text-overflow + white-space ? Is this something you tried via CSS or does it need to be reduced from the code itself , not only at screen ? — G-Cyrillus
– G-Cyrillus, Commented Oct 29, 2022 at 9:43

Steve Barnes · Accepted Answer · 2022-10-29 10:22:48Z

You string is presumably from somewhere else if you encode it as utf-8 then it becomes:

'"overview":"<p style="margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;">WTS/VDI macOS\n<hr>\n<span...' which tools such as BeautifulSoap should handle.

Note that if your string is the result of a subprocess.run with capture turned on then adding a encoding="utf-8" parameter should do this for you.

Vishnukk · Accepted Answer · 2022-10-29 10:39:21Z

This should do!

content = "overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......

soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'))
for tag in soup():
 for attribute in ["class", "id", "name", "style"]:
   del tag[attribute]
    
new_s = os.linesep.join([s for s in soup.text.splitlines() if s])
print(new_s)

This will give me

"overview":"WTS/VDI (Citrix) - macOS WTS (Windows Terminal Services) and VDI (Virtual Desktop Instance) provide you....

without the HTML Tags!

Driftr95 · Accepted Answer · 2022-10-29 14:27:00Z

You can just encode as utf-8 before passing to BeautifulSoup (since it also accepts input in bytes, not just str) and then extracting the text:

# pasted your string excerpt to variable xstr: 
xstr = '"overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......'

print(BeautifulSoup(xstr.encode('utf-8')).text)

output: "overview":"WTS/VDI macOS

Collectives™ on Stack Overflow

Remove unicode HTML tags in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related