-1

I really dont know how to word this. I am creating a program that reads through another py file called code.py, it will find all VALID dictionary variable names and print them, easy enough? But the code im trying to run through is extremely tricky, purposely put in examples to trick the regex. The test code for code.py is here and my current code is:

  import re
    with open ("code.py", "r") as myfile:
        data=myfile.read()
        potato = re.findall(r' *(\w+)\W*{',data,re.M)
        for i in range(len(potato)):
          print(potato[i])

That regex doesnt work 100%, when used on the test code it will print variables that arent meant to be printed such as:

# z={} 
z="z={}"
print('your mother = {}')

The expected output for the test file is a0, a, b ,c d, e, etc all the way down to z, then it will be aa, ab , ac, ad, etc all the way down to aq

and anything really labeled z in the test code shouldnt print. I realise that regex isn't amazing for doing this but i have to use regex and it can be done.

EDIT: Using the new regex (r'^ (\w+)\W{',data,re.M) the output fails on examples where there are variables assigned on one line such as,

d={
   };e={
        };
5
  • Other than "l should print but z shouldn't" you don't explain what is the expected output. Or in other words, what is a "valid dictionary name" ? Commented Aug 13, 2015 at 6:47
  • Well anything that would be assaigned to a dictionary, so if i do d = {} that would make d a dictionary in python. but if i did d = '{}' that would make it a string not a dictionary and thus not be valid. Commented Aug 13, 2015 at 6:54
  • Didn't you already ask this question? ...Multiple times? Commented Aug 13, 2015 at 7:01
  • I can't get a answer that i understand and people just instantly down vote because of my poor english. I am trying to understand regex a bit better but i just cant. One of the answers told me to just not use regex but i have to and the other one was me trying to understand how to start the problem, just asking for advice on what regex to start with. Commented Aug 13, 2015 at 7:10
  • Why do you 'have' to use a regex? Attempting python code parsing using a regex is likely to always struggle with some particular code layout or other - you need to use a proper parser, and as it happens there are some already built for you in the Python Standard Library - look under Python Language Services for the parser, ast or possibly the symtable module. Commented Aug 13, 2015 at 12:57

2 Answers 2

2

l should print but z shouldn't

potato = re.findall(r'^ *(\w+)\W*{',data,re.M)

This should fix it.

EDIT:

".*?(?<!\\)"|'.*?(?<!\\)'|\([^)(]*\)|#[^\n]*\n|[^\'\"\#(\w\n]*(\w+)[^\w]*?{

See demo.

https://regex101.com/r/gP5iH5/6

Sign up to request clarification or add additional context in comments.

10 Comments

Hmm that fixes certain problems but then i still have the problem of assigning multiple dictionaries on the same line using ' ; '
Ok final problem, the only thing now is it will output even if it's in a string or if its a comment, that is: # z={} and z="z={}" or z="\\\\{}"
Gah it just doesnt agree with all of them :/, is there some way to say 'Find anything except #, " and ()' so that it will match anything that doesnt have that before the {
that seems to deselect the # but still prints the z after it.
Firstly it doesn't work properly(the first ' thinks you are opening a string and its trying to close it but cant find a match ), i put \'s in front of the [^\'\"\#(\w\n] so that it would run properly. There were quite a few that weren't meant to output you can see them all at regex101.com/r/gP5iH5/2 (dont want to clog up section) Pretty much anything that is z or a word shouldn't be matched
|
0

Trying to parse a Python file using a regular expression will usually be able to be fooled. I would suggest the following kind of approach. The dis library could be used to disassemble byte code from compiled source code. From this all of the dictionaries can be picked out.

So assuming a Python source file called code.py:

import code
source_module = code
source_py = "code.py"

import sys, dis, re
from contextlib import contextmanager
from StringIO import StringIO

@contextmanager
def captureStdOut(output):
    stdout = sys.stdout
    sys.stdout = output
    yield
    sys.stdout = stdout

with open(source_py) as f_source:
    source_code = f_source.read()
    byte_code = compile(source_code, source_py, "exec")
    output = StringIO()

with captureStdOut(output):
    dis.dis(byte_code)
    dis.dis(source_module)

disassembly = output.getvalue()
dictionaries = re.findall("(?:BUILD_MAP|STORE_MAP).*?(?:STORE_FAST|STORE_NAME).*?\((.*?)\)", disassembly, re.M+re.S)

print dictionaries

As dis prints to stdout, you need to redirect the output. A regular expression can then be used to spot all of the entries. I do this twice, once by compiling the source to get the globals and once by importing the module to get the functions. There is probably a better way to do this but it seems to work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.