269

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).

Is there another way to check?

2
  • 2
    String encoding differs quite a bit between Python 2 and Python 3, so it would be good to know which version you're targeting. Commented Jul 13, 2017 at 9:46
  • @florisla Based on the error from ord('é'), OP's using Python 2. Commented Jul 12, 2020 at 18:16

15 Answers 15

290

I think you are not asking the right question--

A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.

Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"
Sign up to request clarification or add additional context in comments.

7 Comments

use encode is better, because string no decode method in python 3, see what's the difference between encode/decode? (python 2.x)
@Sri: That is because you are using it on an unencoded string (str in Python 2, bytes in Python 3).
In Python 2, this solution only works for a unicode string. A str in any ISO encoding would need to be encoded to Unicode first. The answer should go into this.
@JetGuo: you should use both depending on the input type: s.decode('ascii') if isinstance(s, bytes) else s.encode('ascii') in Python 3. OP's input is a bytestring 'é' (Python 2 syntax, Python 3 hadn't been released at the time) and therefore .decode() is correct.
@alexis: wrong. str on Python 2 is a bytestring. It is correct to use .decode('ascii') to find out whether all bytes are in the ascii range.
|
236
def is_ascii(s):
    return all(ord(c) < 128 for c in s)

14 Comments

Pointlessly inefficient. Much better to try s.decode('ascii') and catch UnicodeDecodeError, as suggested by Vincent Marchetti.
It's not inefficient. all() will short-circuit and return False as soon as it encounters an invalid byte.
Inefficient or not, the more pythonic method is the try/except.
It is inefficient compared to the try/except. Here the loop is in the interpreter. With the try/except form, the loop is in the C codec implementation called by str.decode('ascii'). And I agree, the try/except form is more pythonic too.
@JohnMachin ord(c) < 128 is infinitely more readable and intuitive than c <= "\x7F"
|
192

New in Python 3.7 (bpo32677)

No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method - .isascii() will check if the strings is ascii.

print("is this ascii?".isascii())
# True

6 Comments

"\x03".isascii() is also True. The documentation says this just checks that all characters are below code point 128 (0-127). If you also want to avoid control characters, you will need: text.isascii() and text.isprintable(). Just using isprintable by itself is also not enough, as it will consider a character like ¿ to be (correctly) printable, but it's not within the ascii printable section, so you need to check both if you want both. Yet another gotcha: spaces are considered printable, tabs and newlines are not.
@Luc Good to know, but ASCII includes control chars. Avoiding them is another topic.
@wjandrea Yeah, obviously, but because 0x03 fits in 7 bits doesn't mean that it's what most people will want to be checking for when they find this page in their search results.
@Luc Yes, exactly. If someone thinks that all ASCII characters are safe to print, they're mistaken, but that's a valid topic and could deserve its own question.
It's unfortunate that there isn't some way to make this answer jump to the top other than wait for upvotes. If the OP would log on again, they could at least accept it, but it seems they haven't been seen at all since posting this question.
|
190

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.

def isascii(s):
    """Check if the characters in string s are in ASCII, U+0-U+7F."""
    return len(s) == len(s.encode())

To check, pass the test string:

>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True

7 Comments

This is a nice little trick to detect non-ascii characters in Unicode strings, which in python3 is pretty much all the strings. Since ascii characters can be encoded using only 1 byte, so any ascii characters length will be true to its size after encoded to bytes; whereas other non-ascii characters will be encoded to 2 bytes or 3 bytes accordingly which will increase their sizes.
By @far the best answer, but not that some chars like … and — may look like ascii, so in case you want to use this to detect english text make you replace such chars before checking
But in Python2 it'll throw an UnicodeEncodeError. Got to find a solution for both Py2 and Py3
This is just plain wasteful. It encodes a string in UTF-8, creating a whole other bytestring. True Python 3 way is try: s.encode('ascii'); return True except UnicodeEncodeError: return False (Like above, but encoding, as strings are Unicode in Python 3). This answer also raises an error in Python 3 when you have surrogates (e.g. isascii('\uD800') raises an error instead of returning False)
This looks quite beautiful, but I wonder if it's as efficient as all when handling a long string
|
29

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.

10 Comments

OP's input is a bytestring (bytes type in Python 3 that has no .encode() method). .decode() in @Vincent Marchetti's answer is correct.
@J.F.Sebastian The OP asks "How to check if a string in Python is in ASCII?" and does not specify bytes vs unicode strings. Why do you say his/her input is a bytestring?
look at the date of the question: 'é' was a bytestring at the time.
@J.F.Sebastian, ok, well considering this answer answers this question as if it were asked today, I think it's still valid and helpful. Fewer and fewer people will come here looking for answers as if they were running Python in 2008
I found this question when i was searching for a solution for python3 and quickly reading the question didn't make me suspect that this was python 2 specfic. But this answer was really helpful - upvoting!
|
19

Ran into something like this recently - for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

3 Comments

Of course, this requires the chardet library.
yes, though chardet is available by default in most installations
chardet only guesses the encoding with a certain probability like this: {'confidence': 0.99, 'encoding': 'EUC-JP'} (which in this case was completely wrong)
18

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.

Instead of ord(u'é'), try this:

>>> [ord(x) for x in u'é']

That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'\xe9'

This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.

Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.

The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.

6 Comments

'é' does not necessarily represent a single code point. It could be two code points (U+0065 + U+0301).
Each abstract character is always represented by a single code point. However, code points may be encoded to multiple bytes, depending on the encoding scheme. i.e., 'é' is two bytes in UTF-8 and UTF-16, and four bytes in UTF-32, but it is in each case still a single code point — U+00E9.
@Ben Blank: U+0065 and U+0301 are code points and they do represent 'é' which can also be represented by U+00E9. Google "combining acute accent".
J.F. is right about combining U+0065 and U+0301 to form 'é' but this is not a reversible functino. You will get U+00E9. According to wikipedia, these composite code points are useful for backwards compatibility
@teehoo - It is a reversible function in the sense that you may re-normalize the code point representing the composed character into a sequence of code points representing the same composed character. In Python you can do this like so: unicodedata.normalize('NFD', u'\xe9').
|
9

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

1 Comment

This fails if you string contains ASCII characters which are not letters. For you code examples, that includes newline, space, dot, comma, underscore, and parentheses.
9

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you're getting a rude and persistent

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

Comments

4

To improve Alexander's solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

1 Comment

2

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.

1 Comment

The re module in the Python standard library does not support POSIX character classes.
2

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.

Comments

2

Like @RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.

Comments

0
import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

To include an empty string as ASCII, change the + to *.

Comments

-2

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

1 Comment

This try wrapper is completely pointless. If "¶" is a Unicode string, then ord("¶") will work, and if it’s not (Python 2), for c in s will decompose it into bytes so ord will continue to work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.