How to check if a string in Python is in ASCII?

Question

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).

Is there another way to check?

String encoding differs quite a bit between Python 2 and Python 3, so it would be good to know which version you're targeting. — florisla
– florisla, Commented Jul 13, 2017 at 9:46
@florisla Based on the error from ord('é'), OP's using Python 2. — wjandrea
– wjandrea, Commented Jul 12, 2020 at 18:16

Vincent Marchetti · Accepted Answer · 2008-10-13 00:30:32Z

290

I think you are not asking the right question--

A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.

Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

answered Oct 13, 2008 at 0:30

Vincent Marchetti

4,9583 gold badges21 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Jet Guo Over a year ago

use encode is better, because string no decode method in python 3, see what's the difference between encode/decode? (python 2.x)

dotancohen Over a year ago

@Sri: That is because you are using it on an unencoded string (str in Python 2, bytes in Python 3).

alexis Over a year ago

In Python 2, this solution only works for a unicode string. A str in any ISO encoding would need to be encoded to Unicode first. The answer should go into this.

jfs Over a year ago

@JetGuo: you should use both depending on the input type: s.decode('ascii') if isinstance(s, bytes) else s.encode('ascii') in Python 3. OP's input is a bytestring 'é' (Python 2 syntax, Python 3 hadn't been released at the time) and therefore .decode() is correct.

jfs Over a year ago

@alexis: wrong. str on Python 2 is a bytestring. It is correct to use .decode('ascii') to find out whether all bytes are in the ascii range.

|

Alexander Kojevnikov · Accepted Answer · 2008-10-13 00:30:43Z

236

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

answered Oct 13, 2008 at 0:30

Alexander Kojevnikov

17.7k5 gold badges51 silver badges46 bronze badges

14 Comments

ddaa Over a year ago

Pointlessly inefficient. Much better to try s.decode('ascii') and catch UnicodeDecodeError, as suggested by Vincent Marchetti.

John Millikin Over a year ago

It's not inefficient. all() will short-circuit and return False as soon as it encounters an invalid byte.

Jeremy Cantrell Over a year ago

Inefficient or not, the more pythonic method is the try/except.

ddaa Over a year ago

It is inefficient compared to the try/except. Here the loop is in the interpreter. With the try/except form, the loop is in the C codec implementation called by str.decode('ascii'). And I agree, the try/except form is more pythonic too.

Slater Victoroff Over a year ago

@JohnMachin ord(c) < 128 is infinitely more readable and intuitive than c <= "\x7F"

|

Taku · Accepted Answer · 2018-07-02 18:32:22Z

192

New in Python 3.7 (bpo32677)

No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method - .isascii() will check if the strings is ascii.

print("is this ascii?".isascii())
# True

answered Jul 2, 2018 at 18:32

Taku

34.1k12 gold badges79 silver badges88 bronze badges

6 Comments

Luc Over a year ago

"\x03".isascii() is also True. The documentation says this just checks that all characters are below code point 128 (0-127). If you also want to avoid control characters, you will need: text.isascii() and text.isprintable(). Just using isprintable by itself is also not enough, as it will consider a character like ¿ to be (correctly) printable, but it's not within the ascii printable section, so you need to check both if you want both. Yet another gotcha: spaces are considered printable, tabs and newlines are not.

wjandrea Over a year ago

@Luc Good to know, but ASCII includes control chars. Avoiding them is another topic.

Luc Over a year ago

@wjandrea Yeah, obviously, but because 0x03 fits in 7 bits doesn't mean that it's what most people will want to be checking for when they find this page in their search results.

wjandrea Over a year ago

@Luc Yes, exactly. If someone thinks that all ASCII characters are safe to print, they're mistaken, but that's a valid topic and could deserve its own question.

John Y Over a year ago

It's unfortunate that there isn't some way to make this answer jump to the top other than wait for upvotes. If the OP would log on again, they could at least accept it, but it seems they haven't been seen at all since posting this question.

|

wjandrea · Accepted Answer · 2020-08-17 15:58:49Z

190

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.

def isascii(s):
    """Check if the characters in string s are in ASCII, U+0-U+7F."""
    return len(s) == len(s.encode())

To check, pass the test string:

>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True

edited Aug 17, 2020 at 15:58

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Aug 23, 2013 at 13:14

far

2,7971 gold badge20 silver badges9 bronze badges

7 Comments

Devy Over a year ago

This is a nice little trick to detect non-ascii characters in Unicode strings, which in python3 is pretty much all the strings. Since ascii characters can be encoded using only 1 byte, so any ascii characters length will be true to its size after encoded to bytes; whereas other non-ascii characters will be encoded to 2 bytes or 3 bytes accordingly which will increase their sizes.

Christophe Roussy Over a year ago

By @far the best answer, but not that some chars like … and — may look like ascii, so in case you want to use this to detect english text make you replace such chars before checking

alvas Over a year ago

But in Python2 it'll throw an UnicodeEncodeError. Got to find a solution for both Py2 and Py3

Artyer Over a year ago

This is just plain wasteful. It encodes a string in UTF-8, creating a whole other bytestring. True Python 3 way is try: s.encode('ascii'); return True except UnicodeEncodeError: return False (Like above, but encoding, as strings are Unicode in Python 3). This answer also raises an error in Python 3 when you have surrogates (e.g. isascii('\uD800') raises an error instead of returning False)

Endle_Zhenbo Over a year ago

This looks quite beautiful, but I wonder if it's as efficient as all when handling a long string

|

drs · Accepted Answer · 2015-09-02 15:45:04Z

29

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.

answered Sep 2, 2015 at 15:45

drs

5,8754 gold badges48 silver badges68 bronze badges

10 Comments

jfs Over a year ago

OP's input is a bytestring (bytes type in Python 3 that has no .encode() method). .decode() in @Vincent Marchetti's answer is correct.

drs Over a year ago

@J.F.Sebastian The OP asks "How to check if a string in Python is in ASCII?" and does not specify bytes vs unicode strings. Why do you say his/her input is a bytestring?

jfs Over a year ago

look at the date of the question: 'é' was a bytestring at the time.

drs Over a year ago

@J.F.Sebastian, ok, well considering this answer answers this question as if it were asked today, I think it's still valid and helpful. Fewer and fewer people will come here looking for answers as if they were running Python in 2008

josch Over a year ago

I found this question when i was searching for a solution for python3 and quickly reading the question didn't make me suspect that this was python 2 specfic. But this answer was really helpful - upvoting!

|

Alvin · Accepted Answer · 2011-08-08 20:47:22Z

19

Ran into something like this recently - for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

answered Aug 8, 2011 at 20:47

Alvin

2,55533 silver badges47 bronze badges

3 Comments

StackExchange saddens dancek Over a year ago

Of course, this requires the chardet library.

Alvin Over a year ago

yes, though chardet is available by default in most installations

Suzana Over a year ago

chardet only guesses the encoding with a certain probability like this: {'confidence': 0.99, 'encoding': 'EUC-JP'} (which in this case was completely wrong)

Glyph · Accepted Answer · 2011-07-08 02:10:02Z

18

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.

Instead of ord(u'é'), try this:

>>> [ord(x) for x in u'é']

That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'\xe9'

This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.

Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.

The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.

edited Jul 8, 2011 at 2:10

answered Oct 14, 2008 at 7:36

Glyph

32.1k12 gold badges93 silver badges135 bronze badges

6 Comments

jfs Over a year ago

'é' does not necessarily represent a single code point. It could be two code points (U+0065 + U+0301).

Ben Blank Over a year ago

Each abstract character is always represented by a single code point. However, code points may be encoded to multiple bytes, depending on the encoding scheme. i.e., 'é' is two bytes in UTF-8 and UTF-16, and four bytes in UTF-32, but it is in each case still a single code point — U+00E9.

jfs Over a year ago

@Ben Blank: U+0065 and U+0301 are code points and they do represent 'é' which can also be represented by U+00E9. Google "combining acute accent".

Martin Konecny Over a year ago

J.F. is right about combining U+0065 and U+0301 to form 'é' but this is not a reversible functino. You will get U+00E9. According to wikipedia, these composite code points are useful for backwards compatibility

Glyph Over a year ago

@teehoo - It is a reversible function in the sense that you may re-normalize the code point representing the composed character into a sequence of code points representing the same composed character. In Python you can do this like so: unicodedata.normalize('NFD', u'\xe9').

|

miya · Accepted Answer · 2008-10-13 16:38:25Z

9

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

answered Oct 13, 2008 at 16:38

miya

1,0691 gold badge11 silver badges20 bronze badges

1 Comment

florisla Over a year ago

This fails if you string contains ASCII characters which are not letters. For you code examples, that includes newline, space, dot, comma, underscore, and parentheses.

Community · Accepted Answer · 2017-05-23 11:55:10Z

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you're getting a rude and persistent

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

Sergey Nevmerzhitsky · Accepted Answer · 2015-05-22 09:17:16Z

4

To improve Alexander's solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

edited May 22, 2015 at 9:17

answered May 22, 2015 at 8:48

Sergey Nevmerzhitsky

1332 silver badges9 bronze badges

1 Comment

jfs Over a year ago

it works but beware there are known issues with character classification functions from curses.ascii

Steve Moyer · Accepted Answer · 2008-10-13 00:18:25Z

2

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.

answered Oct 13, 2008 at 0:18

Steve Moyer

5,7311 gold badge27 silver badges34 bronze badges

1 Comment

Flux Over a year ago

The re module in the Python standard library does not support POSIX character classes.

JacquesB · Accepted Answer · 2008-10-14 08:10:06Z

2

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.

edited Oct 14, 2008 at 8:10

answered Oct 14, 2008 at 7:58

JacquesB

42.9k13 gold badges77 silver badges89 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:18:24Z

2

Like @RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Oct 28, 2016 at 16:30

hobs

19.5k10 gold badges91 silver badges112 bronze badges

Comments

Roger Dahl · Accepted Answer · 2015-09-30 14:51:52Z

0

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

To include an empty string as ASCII, change the + to *.

answered Sep 30, 2015 at 14:51

Roger Dahl

15.8k8 gold badges72 silver badges88 bronze badges

Comments

user2489252 · Accepted Answer · 2013-07-07 21:16:00Z

-2

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

answered Jul 7, 2013 at 21:16

user2489252

1 Comment

Ry- Over a year ago

This try wrapper is completely pointless. If "¶" is a Unicode string, then ord("¶") will work, and if it’s not (Python 2), for c in s will decompose it into bytes so ord will continue to work.

Collectives™ on Stack Overflow

How to check if a string in Python is in ASCII?

15 Answers 15

7 Comments

14 Comments

New in Python 3.7 (bpo32677)

6 Comments

7 Comments

10 Comments

3 Comments

6 Comments

1 Comment

Comments

1 Comment

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

15 Answers 15

7 Comments

14 Comments

New in Python 3.7 (bpo32677)

6 Comments

7 Comments

10 Comments

3 Comments

6 Comments

1 Comment

Comments

1 Comment

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related