Can compile function output be used to detect changes in the source code

Question

Python's builtin function compile()

code = """def fact(x):
    if x <= 1:
        return 1
    print (1)
    return x*fact(x-1)
fact(10)"""
c = compile(code, "<string>", "exec")

code = """def fact(y):
    if y <= 1:
        return 1
    print (1)
    return y*fact(y-1)
fact(10)"""
d = compile(code, "<string>", "exec")

here c == d is False. Which is the expected behavior? (adding space in source code or making print 1 instead of print(1) doesn't result in a changed object, which is correct.)

My question is: Can compile object be used to detect changes in the Python source code?

Edit:

To explain it more clearly, I am working on a web application that will allow users to execute their Python code.

Every time a user executes code, its executed on the server. Even adding a space/bracket etc results in a new execution. I am trying to optimize this step by storing the compiled code and if the code is the same in the new request then not executing it.

How do I know that code has changed? (Is it the kind of change that requires executing the code again or a simple space addition.)

I do think there are other ways to achieve what I am trying to achieve by using hashed value or something like this but I want to do it this way because it seems more Pythonic and better than reinventing the wheel.

@QuentinPradet Thanks. I would prefer to use a builtin function rather a 3rd party module. Also my question is different. — Amit Tripathi
– Amit Tripathi, Commented Jun 24, 2016 at 7:39
Since meta can get the variable names back, it has to be present in the compile output. I guess the answer is yes. But if you only care about ast changes, why don't you simply use ast.parse ? Using the AST pretty-printer from greentreesnakes.readthedocs.io/en/latest/index.html — Quentin Pradet
– Quentin Pradet, Commented Jun 24, 2016 at 7:44
Actually I am not concerned about about AST changes. I want to know can I detect changes in source code using compile. — Amit Tripathi
– Amit Tripathi, Commented Jun 24, 2016 at 7:55
It sounds like you have something very specific in mind; Why compile? It would seem this type of problem is solved in a code repository. So what is the problem you're trying to solve here? Or is this an academic question? — Marcel Wilson
– Marcel Wilson, Commented Jun 24, 2016 at 16:25

Dimitris Fasarakis Hilliard · Accepted Answer · 2016-06-27 20:37:41Z

Don't do this, as an optimization it won't result in great speed ups.

compile is a built-in function, it is implemented in C, it is fast and is not a place where you should be looking to optimize. When a user tries to execute code you should allow it to be compiled and executed without any sort of caching.

Consider the following, writing code that tries to discover if there's any difference in text entered by a user, will most likely result in more execution time than actually just compiling the function and executing it. Apart from this, you also have the fact that you'll have to store and retrieve the compile code somewhere. The first adds unnecessary space requirements and, the second, adds to the overhead because of the look ups you have to do (depending on how you chose to store it, of course).

Apart from what I just said, you can try and compare the result of compiling Python code but the approach is limited and obfuscated. I will assume the code snippets you entered should compare equal since the only difference between them lies in the parameter names (whose name has no impact on code execution).

Use the byte code:

In general the approach I'm going to present only returns "equal" (True) if the code snippets in question only differ in white-space and/or parameter names, in all other cases, it returns False (you could attempt to make it more intelligent but that would probably require you taking many edge cases into consideration).

What you should generally be aware of is that compile returns a code object. code objects generally contain all the information Python needs in order to execute a piece of code. You can view the instructions contained in a code object by using the dis module:

from dis import dis
dis(c)
  1           0 LOAD_CONST               0 (<code object fact at 0x7fa7bc30e630, file "<string>", line 1>)
              3 MAKE_FUNCTION            0
              6 STORE_NAME               0 (fact)

  6           9 LOAD_NAME                0 (fact)
             12 LOAD_CONST               1 (10)
             15 CALL_FUNCTION            1
             18 POP_TOP             
             19 LOAD_CONST               2 (None)
             22 RETURN_VALUE

This is what your code snippet does. It loads another code object (which represents the function fact), makes a function call and returns.

The same exact output is generated when you call dis(d), the only difference lies in the loading of the code object for fact:

dis(d)
  1           0 LOAD_CONST               0 (<code object fact at 0x7fa7bc30e5b0, file "<string>", line 1>)

As you can see, they are different:

code object fact at 0x7fa7bc30e630 != code object fact at 0x7fa7bc30e5b0

but their difference does not lie in their functionality, rather they are different instances representing a unit of functionality that is identical.

So here lies the question: do we care about the fact that they are different instances if they represent the same functionality? Similarly, do we care if two variables have different names if they represent the same value? I'm assuming we don't and this is why I think you can make a meaningful basic comparison of code objects if you compare their raw byte code.

Comparing the raw byte code:

Raw byte code is pretty much what I previously described, no names, no identities just a sequence of commands for the Python interpreter (pure functionality). Let's take a look at it:

c.co_code
'd\x00\x00\x84\x00\x00Z\x00\x00e\x00\x00d\x01\x00\x83\x01\x00\x01d\x02\x00S'

Okay, that's ugly, let's have a better look:

dis(c.co_code)
  0 LOAD_CONST          0 (0)
  3 MAKE_FUNCTION       0
  6 STORE_NAME          0 (0)
  9 LOAD_NAME           0 (0)
 12 LOAD_CONST          1 (1)
 15 CALL_FUNCTION       1
 18 POP_TOP        
 19 LOAD_CONST          2 (2)
 22 RETURN_VALUE

That looks better. This is exactly the same as the previous dis(c) command, the only difference is that the names are not present since they don't really play a part in all this.

So, what do we get if we compare d.co_code with c.co_code? Of course, equality since the commands executed are identical. But, there's a pit-fall here, in order to be 100% sure that d.co_code is equal to c.co_code we also need to compare the code objects for the functions loaded inside c and d (the code objects representing the function fact) if we don't compare these, we'll be getting false positives.

So where does the code object for the functions fact lie in each case? They lie in a field called co_consts inside the code object c and d respectively, co_consts is a list containing all the constants for a specific code object. If you take a peak inside there, you'll be able to see the definition for each of these:

# located at position 0 in this case.
# the byte code for function 'fact'
dis(c.co_consts[0])  
  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (1)
              6 COMPARE_OP               1 (<=)
              9 POP_JUMP_IF_FALSE       16

  3          12 LOAD_CONST               1 (1)
             15 RETURN_VALUE        

  4     >>   16 LOAD_CONST               1 (1)
             19 PRINT_ITEM          
             20 PRINT_NEWLINE       

  5          21 LOAD_FAST                0 (x)
             24 LOAD_GLOBAL              0 (fact)
             27 LOAD_FAST                0 (x)
             30 LOAD_CONST               1 (1)
             33 BINARY_SUBTRACT     
             34 CALL_FUNCTION            1
             37 BINARY_MULTIPLY     
             38 RETURN_VALUE

So, what do we do? We compare their raw byte code to see if they represent the same functionality, as we did previously.

As you can understand this turns out to be a recursive procedure where we first compare the raw byte code for the input, then scan through co_consts to see if another code object is present and repeat until no code objects can be found, if the code objects are in a different position in co_consts we'll return False.

In code, that should look like this:

from types import CodeType 

def equal_code(c1, c2):
    if c1.co_code != c2.co_code:
        return False
    for i, j in zip(c1.co_consts, c2.co_consts):
        if isinstance(i, CodeType):
            if isinstance(j, CodeType):
                return equal_code(i, j)
            else: # consts in different position, return false
                return False
    return True

Where CodeType from types is used to check for code instances.

I think this is pretty much the best you can do using only code objects generated from compile.

Thanks @Jim for the detailed answer. I have 3 questions. 1. You said will most likely result in more execution time than actually just compiling the function and executing it isn't Python internally first compiles the code and then execute? 2. Should I use equal_code approach or not because it will still require saving compiled code somewhere? 3. Where do I start learning about Python internals?
@De̲̳̳ath-Stalke̲̳̳r 1) In an interactive session (REPL) Python always compiles input and then executes. When running a module, Python might skip the compilation step if nothing has changed and just uses the .pyc files from previous executions. In both cases, the execution is costly, not the compilation. 2) I would just execute the code without performing any sort of caching I just wanted to show you how you can compare 2 code objects in a meaningful way. If you did the c1 parameter of equal_code would be the saved compiled code.
3) The source is always a good reference. After this, another good resourse is the series Python Innards and probably any article you can read from Eli Bendersky's blog, he also has a number of articles on Python innards.

Burhan Khalid · Accepted Answer · 2016-06-27 10:18:57Z

2

There are many ways to do this that are a lot simpler than what you are trying:

Use diff, with the -w (ignore whitespace) flag. If there is a resulting difference, you will know it is highly likely that it is a code change.
Use git or other source code repository. Then find out if there is a change done between files before deciding to execute. However, in this respect you are just using the diff-ing capabilities of git, so might as well go with the first option.

answered Jun 27, 2016 at 10:18

Burhan Khalid

175k20 gold badges254 silver badges291 bronze badges

Comments

Useless · Accepted Answer · 2016-06-27 10:46:15Z

So many questions...

Can compile object be used to detect changes in the Python source code?

Yes, sort of. However, since code objects contain the names of their local variables, renaming x to y will make your comparison fail even though there's no functional change.

How do I know that code has changed? (Is it the kind of change that requires executing the code again or a simple space addition.)

It's worth mentioning that a "simple space addition" may require re-compilation and re-execution in Python, more so than in many other languages.

I do think their are other way to achieve what I a trying to achieve by using hashed value or something like this but I want to do it this way because it seems more Pythonic and better than reinventing the wheel.

I'm not sure what's Pythonic about this particular option - maybe it makes your code a lot simpler? That would be a perfectly good reason to choose it.

Otherwise, string comparison is probably faster (with the same sensitivity to variable renaming), and a full AST comparison is more complex but potentially smarter.

Finally:

Every time an user executes his/her code its executed on the server. Even adding an space/bracket etc results into a new execution. I am trying to optimize this step by storing the compiled code and if the code is same in new request then not executing it.

But you probably should execute the user's code when they explicitly ask you to.

If you were doing this every time they typed a character, it would obviously pointless, but consider code using random numbers: the user would reasonably expect to see the output change when they hit execute, even with no code change.

Collectives™ on Stack Overflow

Can compile function output be used to detect changes in the source code

3 Answers 3

Use the byte code:

Comparing the raw byte code:

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Use the byte code:

Comparing the raw byte code:

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related