0

I have a string from which I would like to remove the HTML tags.

"overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......

I would like to just have

"overview":"WTS/VDI macOS.....

I tried with BeautifulSoap and Python Bleach, but it only recognizes if the tags are written in '<' and '>' format. Is there a library or any function which removes this for me? Or should I convert the unicode characters and do it manually?

1
  • text-overflow + white-space ? Is this something you tried via CSS or does it need to be reduced from the code itself , not only at screen ? Commented Oct 29, 2022 at 9:43

3 Answers 3

1

You string is presumably from somewhere else if you encode it as utf-8 then it becomes:

'"overview":"<p style="margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;"><span style="font-family: arial, helvetica, sans-serif;"><strong><span style="font-size: 18pt; outline: none !important;">WTS/VDI macOS</span></strong></span></p>\n<hr>\n<p><span...' which tools such as BeautifulSoap should handle.

Note that if your string is the result of a subprocess.run with capture turned on then adding a encoding="utf-8" parameter should do this for you.

Sign up to request clarification or add additional context in comments.

Comments

1

This should do!

content = "overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......

soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'))
for tag in soup():
 for attribute in ["class", "id", "name", "style"]:
   del tag[attribute]
    
new_s = os.linesep.join([s for s in soup.text.splitlines() if s])
print(new_s)

This will give me

"overview":"WTS/VDI (Citrix) - macOS WTS (Windows Terminal Services) and VDI (Virtual Desktop Instance) provide you....

without the HTML Tags!

Comments

1

You can just encode as utf-8 before passing to BeautifulSoup (since it also accepts input in bytes, not just str) and then extracting the text:

# pasted your string excerpt to variable xstr: 
xstr = '"overview":"\u003cp style=\"margin: 0px 0px 20px; padding: 0px; line-height: 20px; outline: none !important; min-height: 1em; color: #333333; font-family: Arial, Helvetica, sans-serif, emoji; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; text-align: center;\"\u003e\u003cspan style=\"font-family: arial, helvetica, sans-serif;\"\u003e\u003cstrong\u003e\u003cspan style=\"font-size: 18pt; outline: none !important;\"\u003eWTS/VDI macOS\u003c/span\u003e\u003c/strong\u003e\u003c/span\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cspan ......'

print(BeautifulSoup(xstr.encode('utf-8')).text)

output: "overview":"WTS/VDI macOS

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.