How to detect if a String has specific UTF-8 characters in it? (Python)

Question

I have a list of Strings in python. Now I want to remove all the strings from the list that are special utf-8 characters. I want just the strings which include just the characters from "U+0021" to "U+00FF". So, do you know a way to detect if a String just contains these special characters?

Thanks :)

EDIT: I use Python 3

Characters above U+00FF aren't "special"; you merely don't want them, which is entirely arbitrary. — deceze
– deceze ♦, Commented Jul 11, 2016 at 13:27
@deceze even if they are not special, I dont want them, right :) — tommitomtom
– tommitomtom, Commented Jul 11, 2016 at 13:28

frnhr · Accepted Answer · 2016-07-11 13:42:54Z

3

>>> all_strings = ["okstring", "bađštring", "goodstring"]
>>> acceptible = set(chr(i) for i in range(0x21, 0xFF + 1))
>>> simple_strings = filter(lambda s: set(s).issubset(acceptible), all_strings)
>>> list(simple_strings)
['okstring', 'goodstring']

edited Jul 11, 2016 at 13:42

answered Jul 11, 2016 at 13:37

frnhr

13k9 gold badges70 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Grant Curell Over a year ago

I think this is the better answer because the question title is "How to detect if a String has specific UTF-8 characters in it? (Python)" and this answer generically allows you to check for any set of unicode characters whereas the accepted answer only checks for non-ASCII style characters.

Blablablabli · Accepted Answer · 2016-07-11 13:40:33Z

1

What do you mean exactly by "special utf-8 characters" ?

If you mean every non-ascii character, then you can try:

s.encode('ascii', 'strict')

It will rise an UnicodeDecodeError if the string is not 100% ascii

edited Jul 11, 2016 at 13:40

answered Jul 11, 2016 at 13:34

Blablablabli

18410 bronze badges

2 Comments

tommitomtom Over a year ago

@BlaBlaBlabli I mean more "specific" than special. The specific characters are every character which is not "U+0021" to "U+00FF". So if I find a character outside of it in my string, i want to do somethin with the string like deleting it from a lsit.

Blablablabli Over a year ago

@SergeBallesta ups, I meant encode, I've edited my answer.

Serge Ballesta · Accepted Answer · 2016-07-11 13:49:15Z

0

The latin1 encoding correspond to the 256 first utf8 characters. Say differently, if c is a unicode character with a code in [0-255], c.encode('latin1') has same value as ord(c).

So to test whether a string has at least one character outside the [0-255] range, just try to encode it as latin1. If it contains none, the encoding will succeed, else you will get a UnicodeEncodeError:

no_special = True
try:
    s.encode('latin1')
except UnicodeEncodeError:
    no_special = False

BTW, as you were told in comment unicode characters outside the [0-255] range are not special, simply they are not in the latin1 range.

Please note that the above also accepts all control characters like \t, \r or \n because they are legal latin1 characters. It may or not be what you want here.

answered Jul 11, 2016 at 13:49

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

1 Comment

Serge Ballesta Over a year ago

This answers is obviously only valid for Python3. It makes sense for Python2 only if s is a unicode string.

ltux · Accepted Answer · 2016-07-11 14:04:30Z

0

You can use regular expression.

import re
mylist = ['str1', 'štr2', 'str3']
regexp = re.compile(r'[^\u0021-\u00FF]')
good_strs = filter(lambda s: not regexp.search(s), mylist)

[^\u0021-\u00FF] defines a character set, meaning any one character not in the range from \u0021 to \u00FF. The letter r before '[\u0021-\u00FF]' indicates raw string notation, it saves you a lot of escaping works of backslash ('\'). Without it, every backslash in a regular expression would have to be prefixed with another one to escape it.

regexp.search(r'[\u0021-\u00FF]',s) will scan through s looking for the first location where the regular expression r'[^\u0021-\u00FF]' produces a match, and return a corresponding match object. Return None if no match is found.

filter() will filter out the unwanted strings.

This answer is only valid for Python 3

edited Jul 11, 2016 at 14:04

answered Jul 11, 2016 at 13:29

ltux

2532 silver badges9 bronze badges

5 Comments

tommitomtom Over a year ago

I dont want to replace the characters in the string. I want just to know if there is any character outside of my utf-8 charset. You have a example like if (re.find(r'[\u0021-\u00ff]',s)) (i dont know, I am really unfamiliar wih everythin of it

syntonym Over a year ago

@tommitomtom Try re.search.

tommitomtom Over a year ago

@syntonym now it matches like everything. I tried it with some chars in an if decision. I get a match with like every character

ltux Over a year ago

@tommitomtom I misunderstood your question. Will edit it again.

tommitomtom Over a year ago

@ltux I already got a solution. Thanks very much for all your help :)

VinjaNinja · Accepted Answer · 2023-03-23 17:38:06Z

0

The below code snippet worked for me (using Regex in python3):

nonAcceptibleUTF8Chars = list(chr(i) for i in range(161, 255 + 1))
result = re.sub('[' + re.escape(''.join(nonAcceptibleUTF8Chars)) + ']', '', inputString)

inputString = VICTORIAÏ¿½S SECRET

result = VICTORIAS SECRET

Though late to the party, Hope this helps! :)

answered Mar 23, 2023 at 17:38

VinjaNinja

597 bronze badges

Collectives™ on Stack Overflow

How to detect if a String has specific UTF-8 characters in it? (Python)

5 Answers 5

1 Comment

2 Comments

1 Comment

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

1 Comment

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related