33

I have a string in Python like this:

u'\u200cHealth & Fitness'

How can i remove the

\u200c

part from the string ?

3
  • s.encode('utf-8') Commented Sep 11, 2017 at 11:26
  • @Vinny the return string is \xe2\x80\x8cHealth & Fitness Commented Sep 11, 2017 at 11:27
  • my bad, the encoding should be ascii as Arount answered below Commented Sep 11, 2017 at 11:48

6 Answers 6

57

You can encode it into ascii and ignore errors:

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

Output:

'Health & Fitness'
Sign up to request clarification or add additional context in comments.

1 Comment

This obviously works in the above example but you are forcing the string into ascii losing all unicode chars, which obviously is not a solution that works for all
34

If you have a string that contains Unicode character, like

s = "Airports Council International \u2013 North America"

then you can try:

newString = (s.encode('ascii', 'ignore')).decode("utf-8")

and the output will be:

Airports Council International North America

3 Comments

shouldn't we decode 'ascii' after encoding to ascii
If you have a list of strings, you can adapt this as a list comprehension: list_text_fixed = [(s.encode('ascii', 'ignore')).decode("utf-8") for s in list_text]
This is a bad solution, it will remove ALL unicode characters, not just zero width space.
31

I just use replace because I don't need it:

varstring.replace('\u200c', '')

Or in your case:

u'\u200cHealth & Fitness'.replace('\u200c', '')

3 Comments

This is actually better than the accepted answer in most strings. The \u200c is a zero width non joiner, which is an unusual whitespace-type character that strip() ignores. In most cases with unicode strs you do not want to encode(ascii, ignore).
This is general solution since ascii may remove some other Unicode characters as well.
appreciate this!
5

for me the following worked

mystring.encode('ascii', 'ignore').decode('unicode_escape')

3 Comments

You could improve your answer by explaining why this code works, and what you're doing here. That way, others can be educated.
tbh, that was a 'Frankenstein' version of all answers that I had previously found but which didn't work. I can't really explain why this one worked over the rest of solutions in my case..
This is a bad solution, it will remove ALL unicode characters, not just zero width space.
2

In the specific case in the question: that the string is prefixed with a single u'\200c' character, the solution is as simple as taking a slice that does not include the first character.

original = u'\u200cHealth & Fitness'
fixed = original[1:]

If the leading character may or may not be present, str.lstrip may be used

original = u'\u200cHealth & Fitness'
fixed = original.lstrip(u'\u200c')

The same solutions will work in Python3. From Python 3.9, str.removeprefix is also available

original = u'\u200cHealth & Fitness'
fixed = original.removeprefix(u'\u200c')

Comments

0

If the Text is just English, this way

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

BUT if such as Arabic, Persian ,... this way:

 s=s.replace('\\', '').replace('u200c', '')

If you're going to write a Text file:

import codecs
    with codecs.open('text_file.txt', 'w', encoding='utf-8') as text_file:
        for line in array_string:

            text_file.write('\u200c' + line + '\n')

3 Comments

Wouldn't it be wrong to remove it from Persian where the orthography requires it?
@Andj, I checked, it kept the structure well: ‌ﺍﻭﻟﻮﯾﺖ\u200cﻫﺎﯼ ﭼﺎﭖ to ﺍﻭﻟﻮﯾﺖﻫﺎﯼ ﭼﺎﭖ , as you can see this half-space is keeping!
except your comment is using presentation forms

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.