1

I have a list of Strings in python. Now I want to remove all the strings from the list that are special utf-8 characters. I want just the strings which include just the characters from "U+0021" to "U+00FF". So, do you know a way to detect if a String just contains these special characters?

Thanks :)

EDIT: I use Python 3

4
  • Which Python, 2 or 3? Commented Jul 11, 2016 at 13:20
  • Characters above U+00FF aren't "special"; you merely don't want them, which is entirely arbitrary. Commented Jul 11, 2016 at 13:27
  • @deceze even if they are not special, I dont want them, right :) Commented Jul 11, 2016 at 13:28
  • @frnhr I use Python 3 :) Commented Jul 11, 2016 at 13:29

5 Answers 5

3
>>> all_strings = ["okstring", "bađštring", "goodstring"]
>>> acceptible = set(chr(i) for i in range(0x21, 0xFF + 1))
>>> simple_strings = filter(lambda s: set(s).issubset(acceptible), all_strings)
>>> list(simple_strings)
['okstring', 'goodstring']
Sign up to request clarification or add additional context in comments.

1 Comment

I think this is the better answer because the question title is "How to detect if a String has specific UTF-8 characters in it? (Python)" and this answer generically allows you to check for any set of unicode characters whereas the accepted answer only checks for non-ASCII style characters.
1

What do you mean exactly by "special utf-8 characters" ?

If you mean every non-ascii character, then you can try:

s.encode('ascii', 'strict')

It will rise an UnicodeDecodeError if the string is not 100% ascii

2 Comments

@BlaBlaBlabli I mean more "specific" than special. The specific characters are every character which is not "U+0021" to "U+00FF". So if I find a character outside of it in my string, i want to do somethin with the string like deleting it from a lsit.
@SergeBallesta ups, I meant encode, I've edited my answer.
0

The latin1 encoding correspond to the 256 first utf8 characters. Say differently, if c is a unicode character with a code in [0-255], c.encode('latin1') has same value as ord(c).

So to test whether a string has at least one character outside the [0-255] range, just try to encode it as latin1. If it contains none, the encoding will succeed, else you will get a UnicodeEncodeError:

no_special = True
try:
    s.encode('latin1')
except UnicodeEncodeError:
    no_special = False

BTW, as you were told in comment unicode characters outside the [0-255] range are not special, simply they are not in the latin1 range.

Please note that the above also accepts all control characters like \t, \r or \n because they are legal latin1 characters. It may or not be what you want here.

1 Comment

This answers is obviously only valid for Python3. It makes sense for Python2 only if s is a unicode string.
0

You can use regular expression.

import re
mylist = ['str1', 'štr2', 'str3']
regexp = re.compile(r'[^\u0021-\u00FF]')
good_strs = filter(lambda s: not regexp.search(s), mylist)

[^\u0021-\u00FF] defines a character set, meaning any one character not in the range from \u0021 to \u00FF. The letter r before '[\u0021-\u00FF]' indicates raw string notation, it saves you a lot of escaping works of backslash ('\'). Without it, every backslash in a regular expression would have to be prefixed with another one to escape it.

regexp.search(r'[\u0021-\u00FF]',s) will scan through s looking for the first location where the regular expression r'[^\u0021-\u00FF]' produces a match, and return a corresponding match object. Return None if no match is found.

filter() will filter out the unwanted strings.

This answer is only valid for Python 3

5 Comments

I dont want to replace the characters in the string. I want just to know if there is any character outside of my utf-8 charset. You have a example like if (re.find(r'[\u0021-\u00ff]',s)) (i dont know, I am really unfamiliar wih everythin of it
@tommitomtom Try re.search.
@syntonym now it matches like everything. I tried it with some chars in an if decision. I get a match with like every character
@tommitomtom I misunderstood your question. Will edit it again.
@ltux I already got a solution. Thanks very much for all your help :)
0

The below code snippet worked for me (using Regex in python3):

nonAcceptibleUTF8Chars = list(chr(i) for i in range(161, 255 + 1))
result = re.sub('[' + re.escape(''.join(nonAcceptibleUTF8Chars)) + ']', '', inputString)

inputString = VICTORIAÏ¿½S SECRET

result = VICTORIAS SECRET

Though late to the party, Hope this helps! :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.