0

I need to determine if a string represents a valid python identifier. Since python 3 identifiers support obscure unicode functionality, and python syntax might change across releases, I decided to avoid manual parsing. Unfortunately my attempts at utilizing python's internal interfaces don't seem to work:

I. function compile

>>> string = "a = 5; b "
>>> test = "{} = 5"
>>> compile(test.format(string), "<string>", "exec")
<code object <module> at 0xb71b4d90, file "<string>", line 1>

Clearly test can't force compile to use ast.Name as the root of the AST.

Next I attempt using the modules ast and parser. These modules are intended to derive a string, rather than determining if a string matches a particular derivation, but I figure they might be helpful anyway.

II. module ast

>>> a=ast.Module(body=[ast.Expr(value=ast.Name(id='1a', ctx=ast.Load()))])
>>> af = ast.fix_missing_locations(a)
>>> c = compile(af, "<string>", "exec")
>>> exec(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name '1a' is not defined

OK, clearly Name isn't parsing '1a' for correctness. Perhaps this step happens earlier, in the parse phase.

III. module parser

>>> p = parser.suite("a")
>>> t = parser.st2tuple(p)
>>> t
(257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, 'a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> 
>>> t = (257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, '1a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> p = parser.sequence2st(t)
>>> c = parser.compilest(p)
>>> exec(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<syntax-tree>", line 0, in <module>
NameError: name '1a' is not defined

OK, still not being checked... why? Quick check of python's full grammar specification shows that NAME is not defined. If these checks are performed by the bytecode compiler, shouldn't 1a have been caught?

I'm starting to suspect python exposes no functionality towards this goal. I'm also curious why some attempts failed.

2 Answers 2

1

You don't need to parse, just tokenize, and -- if you care -- test if the returned NAME is a keyword

Example, partly adapted from the linked documentation:

>>> import tokenize
>>> from io import BytesIO
>>> from keyword import iskeyword
>>> s = "def twoπ(a,b):"
>>> g = tokenize.tokenize(BytesIO(s.encode("utf-8")).readline)
>>> for toktype, tokval, st, end, _ in g:
...   if toktype == tokenize.NAME and iskeyword(tokval):
...     print ("KEYWORD ", tokval)
...   else:
...     print(toktype, tokval)
... 
56 utf-8
KEYWORD  def
1 twoπ
52 (
1 a
52 ,
1 b
52 )
52 :
0 

You'll always get an ENCODING (56) token at the beginning of the input, and an ENDMARKER (0) at the end.

Sign up to request clarification or add additional context in comments.

3 Comments

This is a nice approach but it appears you need to add a check for reserved words separately using keyword.iskeyword as def, for, etc tokenize as NAME. Not sure if there are other edge cases.
Good point. If you care about keywords, you have to test that. I'll edit the answer.
I agree, this is a good approach which also explains why neither ast nor parser worked - identifiers are tokenized and validated before parsing. I find this surprising: python's lexer must encode some state/grammar in order to discriminate valid input, something normally relegated to parsers. Anyway while the tokenize module doesn't expose python internals (unlike ast and parser it reimplements the tokenizer in pure python), it is certainly a more stable approach than using the dis module, which carries a big fat warning: "[No guarantees]..[bytecode varies across VMs and releases]"
1

I'm not sure where you were going with your compile example, but if you compile just the potential identifer for eval, it exposes what is going on.

>>> dis(compile("1", "<string>", "eval"))

  1           0 LOAD_CONST               0 (1)
              3 RETURN_VALUE

>>> dis(compile("a", "<string>", "eval"))

  1           0 LOAD_NAME                0 (a)
              3 RETURN_VALUE

>>> dis(compile("1a", "<string>", "eval"))

  File "<string>", line 1
    1a
     ^
SyntaxError: unexpected EOF while parsing

>>> dis(compile("你好", "<string>", "eval"))

  1           0 LOAD_NAME                0 (你好)
              3 RETURN_VALUE

It would require more testing before using for real (for edge cases), but getting a LOAD_NAME opcode back is indicative. Failure states can include both an exception and getting a different opcode so you have to check for both.

3 Comments

The compile example attempts to validate string as an identifier by enforcing a subtree of python's grammar through assignment. The example also demonstrates how futile this approach is: suffers from something akin to code injection. You example does the same by ensuring the bytecode matches a certain pattern. While a good idea, unfortunately the dis interface is too unstable for my taste.
dis is just to show you - you would use co_code and opcodes to find whether LOAD_NAME is present.
Although, a string like (a) could still be a problem, so never mind.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.