Remove zero width space unicode character from Python string

Question

I have a string in Python like this:

u'\u200cHealth & Fitness'

How can i remove the

\u200c

part from the string ?

my bad, the encoding should be ascii as Arount answered below — Chen A.
– Chen A., Commented Sep 11, 2017 at 11:48

Arount · Accepted Answer · 2017-09-11 11:29:15Z

57

You can encode it into ascii and ignore errors:

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

Output:

'Health & Fitness'

answered Sep 11, 2017 at 11:29

Arount

10.5k1 gold badge34 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Martin Massera Over a year ago

This obviously works in the above example but you are forcing the string into ascii losing all unicode chars, which obviously is not a solution that works for all

Martin Massera · Accepted Answer · 2024-10-15 06:38:19Z

34

If you have a string that contains Unicode character, like

s = "Airports Council International \u2013 North America"

then you can try:

newString = (s.encode('ascii', 'ignore')).decode("utf-8")

and the output will be:

Airports Council International North America

edited Oct 15, 2024 at 6:38

Martin Massera

1,9221 gold badge26 silver badges53 bronze badges

answered Feb 21, 2018 at 7:47

Hayat

1,6494 gold badges22 silver badges32 bronze badges

3 Comments

Vaibhav Vishal Over a year ago

shouldn't we decode 'ascii' after encoding to ascii

timothyjgraham Over a year ago

If you have a list of strings, you can adapt this as a list comprehension: list_text_fixed = [(s.encode('ascii', 'ignore')).decode("utf-8") for s in list_text]

Martin Massera Over a year ago

This is a bad solution, it will remove ALL unicode characters, not just zero width space.

joanis · Accepted Answer · 2019-07-28 14:19:50Z

31

I just use replace because I don't need it:

varstring.replace('\u200c', '')

Or in your case:

u'\u200cHealth & Fitness'.replace('\u200c', '')

edited Jul 28, 2019 at 14:19

joanis

13k23 gold badges38 silver badges50 bronze badges

answered Mar 28, 2019 at 15:06

Sitti Munirah Abdul Razak

1,11912 silver badges11 bronze badges

3 Comments

Chet Over a year ago

This is actually better than the accepted answer in most strings. The \u200c is a zero width non joiner, which is an unusual whitespace-type character that strip() ignores. In most cases with unicode strs you do not want to encode(ascii, ignore).

prosti Over a year ago

This is general solution since ascii may remove some other Unicode characters as well.

user3768258 Over a year ago

appreciate this!

chujudzvin · Accepted Answer · 2018-12-11 10:41:44Z

5

for me the following worked

mystring.encode('ascii', 'ignore').decode('unicode_escape')

answered Dec 11, 2018 at 10:41

chujudzvin

1,3811 gold badge20 silver badges50 bronze badges

3 Comments

RyanZim Over a year ago

You could improve your answer by explaining why this code works, and what you're doing here. That way, others can be educated.

chujudzvin Over a year ago

tbh, that was a 'Frankenstein' version of all answers that I had previously found but which didn't work. I can't really explain why this one worked over the rest of solutions in my case..

Martin Massera Over a year ago

This is a bad solution, it will remove ALL unicode characters, not just zero width space.

snakecharmerb · Accepted Answer · 2021-01-12 17:50:04Z

2

In the specific case in the question: that the string is prefixed with a single u'\200c' character, the solution is as simple as taking a slice that does not include the first character.

original = u'\u200cHealth & Fitness'
fixed = original[1:]

If the leading character may or may not be present, str.lstrip may be used

original = u'\u200cHealth & Fitness'
fixed = original.lstrip(u'\u200c')

The same solutions will work in Python3. From Python 3.9, str.removeprefix is also available

original = u'\u200cHealth & Fitness'
fixed = original.removeprefix(u'\u200c')

answered Jan 12, 2021 at 17:50

snakecharmerb

57.1k13 gold badges136 silver badges200 bronze badges

Comments

Mori · Accepted Answer · 2024-06-17 09:07:01Z

0

If the Text is just English, this way

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

BUT if such as Arabic, Persian ,... this way:

 s=s.replace('\\', '').replace('u200c', '')

If you're going to write a Text file:

import codecs
    with codecs.open('text_file.txt', 'w', encoding='utf-8') as text_file:
        for line in array_string:

            text_file.write('\u200c' + line + '\n')

answered Jun 17, 2024 at 9:07

Mori

4,7712 gold badges28 silver badges34 bronze badges

3 Comments

Andj Over a year ago

Wouldn't it be wrong to remove it from Persian where the orthography requires it?

Mori Over a year ago

@Andj, I checked, it kept the structure well: ‌ﺍﻭﻟﻮﯾﺖ\u200cﻫﺎﯼ ﭼﺎﭖ to ﺍﻭﻟﻮﯾﺖﻫﺎﯼ ﭼﺎﭖ , as you can see this half-space is keeping!

Andj Over a year ago

except your comment is using presentation forms

Collectives™ on Stack Overflow

Remove zero width space unicode character from Python string

6 Answers 6

1 Comment

3 Comments

3 Comments

3 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

3 Comments

3 Comments

3 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related