Python Regex reading dictionary variable name from another file

Question

I really dont know how to word this. I am creating a program that reads through another py file called code.py, it will find all VALID dictionary variable names and print them, easy enough? But the code im trying to run through is extremely tricky, purposely put in examples to trick the regex. The test code for code.py is here and my current code is:

  import re
    with open ("code.py", "r") as myfile:
        data=myfile.read()
        potato = re.findall(r' *(\w+)\W*{',data,re.M)
        for i in range(len(potato)):
          print(potato[i])

That regex doesnt work 100%, when used on the test code it will print variables that arent meant to be printed such as:

# z={} 
z="z={}"
print('your mother = {}')

The expected output for the test file is a0, a, b ,c d, e, etc all the way down to z, then it will be aa, ab , ac, ad, etc all the way down to aq

and anything really labeled z in the test code shouldnt print. I realise that regex isn't amazing for doing this but i have to use regex and it can be done.

EDIT: Using the new regex (r'^ (\w+)\W{',data,re.M) the output fails on examples where there are variables assigned on one line such as,

d={
   };e={
        };

Other than "l should print but z shouldn't" you don't explain what is the expected output. Or in other words, what is a "valid dictionary name" ? — Nir Alfasi
– Nir Alfasi, Commented Aug 13, 2015 at 6:47
Well anything that would be assaigned to a dictionary, so if i do d = {} that would make d a dictionary in python. but if i did d = '{}' that would make it a string not a dictionary and thus not be valid. — Nick Adams
– Nick Adams, Commented Aug 13, 2015 at 6:54
I can't get a answer that i understand and people just instantly down vote because of my poor english. I am trying to understand regex a bit better but i just cant. One of the answers told me to just not use regex but i have to and the other one was me trying to understand how to start the problem, just asking for advice on what regex to start with. — Nick Adams
– Nick Adams, Commented Aug 13, 2015 at 7:10
Why do you 'have' to use a regex? Attempting python code parsing using a regex is likely to always struggle with some particular code layout or other - you need to use a proper parser, and as it happens there are some already built for you in the Python Standard Library - look under Python Language Services for the parser, ast or possibly the symtable module. — DisappointedByUnaccountableMod
– DisappointedByUnaccountableMod, Commented Aug 13, 2015 at 12:57

vks · Accepted Answer · 2015-08-15 04:49:15Z

2

l should print but z shouldn't

potato = re.findall(r'^ *(\w+)\W*{',data,re.M)

This should fix it.

EDIT:

".*?(?<!\\)"|'.*?(?<!\\)'|\([^)(]*\)|#[^\n]*\n|[^\'\"\#(\w\n]*(\w+)[^\w]*?{

See demo.

https://regex101.com/r/gP5iH5/6

edited Aug 15, 2015 at 4:49

answered Aug 13, 2015 at 6:49

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Nick Adams Over a year ago

Hmm that fixes certain problems but then i still have the problem of assigning multiple dictionaries on the same line using ' ; '

Nick Adams Over a year ago

Ok final problem, the only thing now is it will output even if it's in a string or if its a comment, that is: # z={} and z="z={}" or z="\\\\{}"

Nick Adams Over a year ago

Gah it just doesnt agree with all of them :/, is there some way to say 'Find anything except #, " and ()' so that it will match anything that doesnt have that before the {

Nick Adams Over a year ago

that seems to deselect the # but still prints the z after it.

Nick Adams Over a year ago

Firstly it doesn't work properly(the first ' thinks you are opening a string and its trying to close it but cant find a match ), i put \'s in front of the [^\'\"\#(\w\n] so that it would run properly. There were quite a few that weren't meant to output you can see them all at regex101.com/r/gP5iH5/2 (dont want to clog up section) Pretty much anything that is z or a word shouldn't be matched

|

Martin Evans · Accepted Answer · 2015-08-13 17:26:20Z

Trying to parse a Python file using a regular expression will usually be able to be fooled. I would suggest the following kind of approach. The dis library could be used to disassemble byte code from compiled source code. From this all of the dictionaries can be picked out.

So assuming a Python source file called code.py:

import code
source_module = code
source_py = "code.py"

import sys, dis, re
from contextlib import contextmanager
from StringIO import StringIO

@contextmanager
def captureStdOut(output):
    stdout = sys.stdout
    sys.stdout = output
    yield
    sys.stdout = stdout

with open(source_py) as f_source:
    source_code = f_source.read()
    byte_code = compile(source_code, source_py, "exec")
    output = StringIO()

with captureStdOut(output):
    dis.dis(byte_code)
    dis.dis(source_module)

disassembly = output.getvalue()
dictionaries = re.findall("(?:BUILD_MAP|STORE_MAP).*?(?:STORE_FAST|STORE_NAME).*?\((.*?)\)", disassembly, re.M+re.S)

print dictionaries

As dis prints to stdout, you need to redirect the output. A regular expression can then be used to spot all of the entries. I do this twice, once by compiling the source to get the globals and once by importing the module to get the functions. There is probably a better way to do this but it seems to work.

Collectives™ on Stack Overflow

Python Regex reading dictionary variable name from another file

2 Answers 2

10 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related