Parse Python Identifier

Question

I need to determine if a string represents a valid python identifier. Since python 3 identifiers support obscure unicode functionality, and python syntax might change across releases, I decided to avoid manual parsing. Unfortunately my attempts at utilizing python's internal interfaces don't seem to work:

I. function compile

>>> string = "a = 5; b "
>>> test = "{} = 5"
>>> compile(test.format(string), "<string>", "exec")
<code object <module> at 0xb71b4d90, file "<string>", line 1>

Clearly test can't force compile to use ast.Name as the root of the AST.

Next I attempt using the modules ast and parser. These modules are intended to derive a string, rather than determining if a string matches a particular derivation, but I figure they might be helpful anyway.

II. module ast

>>> a=ast.Module(body=[ast.Expr(value=ast.Name(id='1a', ctx=ast.Load()))])
>>> af = ast.fix_missing_locations(a)
>>> c = compile(af, "<string>", "exec")
>>> exec(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name '1a' is not defined

OK, clearly Name isn't parsing '1a' for correctness. Perhaps this step happens earlier, in the parse phase.

III. module parser

>>> p = parser.suite("a")
>>> t = parser.st2tuple(p)
>>> t
(257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, 'a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> 
>>> t = (257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, '1a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> p = parser.sequence2st(t)
>>> c = parser.compilest(p)
>>> exec(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<syntax-tree>", line 0, in <module>
NameError: name '1a' is not defined

OK, still not being checked... why? Quick check of python's full grammar specification shows that NAME is not defined. If these checks are performed by the bytecode compiler, shouldn't 1a have been caught?

I'm starting to suspect python exposes no functionality towards this goal. I'm also curious why some attempts failed.

rici · Accepted Answer · 2014-08-14 06:43:27Z

1

You don't need to parse, just tokenize, and -- if you care -- test if the returned NAME is a keyword

Example, partly adapted from the linked documentation:

>>> import tokenize
>>> from io import BytesIO
>>> from keyword import iskeyword
>>> s = "def twoπ(a,b):"
>>> g = tokenize.tokenize(BytesIO(s.encode("utf-8")).readline)
>>> for toktype, tokval, st, end, _ in g:
...   if toktype == tokenize.NAME and iskeyword(tokval):
...     print ("KEYWORD ", tokval)
...   else:
...     print(toktype, tokval)
... 
56 utf-8
KEYWORD  def
1 twoπ
52 (
1 a
52 ,
1 b
52 )
52 :
0

You'll always get an ENCODING (56) token at the beginning of the input, and an ENDMARKER (0) at the end.

edited Aug 14, 2014 at 6:43

answered Aug 14, 2014 at 5:49

rici

243k30 gold badges263 silver badges364 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jason S Over a year ago

This is a nice approach but it appears you need to add a check for reserved words separately using keyword.iskeyword as def, for, etc tokenize as NAME. Not sure if there are other edge cases.

rici Over a year ago

Good point. If you care about keywords, you have to test that. I'll edit the answer.

user19087 Over a year ago

I agree, this is a good approach which also explains why neither ast nor parser worked - identifiers are tokenized and validated before parsing. I find this surprising: python's lexer must encode some state/grammar in order to discriminate valid input, something normally relegated to parsers. Anyway while the tokenize module doesn't expose python internals (unlike ast and parser it reimplements the tokenizer in pure python), it is certainly a more stable approach than using the dis module, which carries a big fat warning: "[No guarantees]..[bytecode varies across VMs and releases]"

Jason S · Accepted Answer · 2014-08-14 05:19:26Z

1

I'm not sure where you were going with your compile example, but if you compile just the potential identifer for eval, it exposes what is going on.

>>> dis(compile("1", "<string>", "eval"))

  1           0 LOAD_CONST               0 (1)
              3 RETURN_VALUE

>>> dis(compile("a", "<string>", "eval"))

  1           0 LOAD_NAME                0 (a)
              3 RETURN_VALUE

>>> dis(compile("1a", "<string>", "eval"))

  File "<string>", line 1
    1a
     ^
SyntaxError: unexpected EOF while parsing

>>> dis(compile("你好", "<string>", "eval"))

  1           0 LOAD_NAME                0 (你好)
              3 RETURN_VALUE

It would require more testing before using for real (for edge cases), but getting a LOAD_NAME opcode back is indicative. Failure states can include both an exception and getting a different opcode so you have to check for both.

answered Aug 14, 2014 at 5:19

Jason S

13.9k2 gold badges42 silver badges43 bronze badges

3 Comments

user19087 Over a year ago

The compile example attempts to validate string as an identifier by enforcing a subtree of python's grammar through assignment. The example also demonstrates how futile this approach is: suffers from something akin to code injection. You example does the same by ensuring the bytecode matches a certain pattern. While a good idea, unfortunately the dis interface is too unstable for my taste.

Jason S Over a year ago

dis is just to show you - you would use co_code and opcodes to find whether LOAD_NAME is present.

Jason S Over a year ago

Although, a string like (a) could still be a problem, so never mind.

Collectives™ on Stack Overflow

Parse Python Identifier

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related