Code matching algorithm

Question

I have two java source code files that do the same thing, give the same output. They differ a little in code they contain like in the example below. I need an algorithm that derteminates the rate of likeness (sameness) between these two java code files.

Example

/* First file */
public int inc (int n) {
    return ++n;
}

/* Second file */
public int inc (int n) {
    return (n+1);
}

Is there an algorithm that shows that these two files do the same thing ?

Thanks in advance

you can check the output always to see if its the same or not. There can be many ways same thing can be achieved. It depends on how to actually implement it. You should make sure that you want the similarity in Way or the Destination/Result — AnkeyNigam
– AnkeyNigam, Commented Jul 15, 2015 at 17:56
In the general case, determining whether two programs do the same thing (for all inputs) is an undecidable problem (I believe). You may be able to do better in specific cases, though. For example, if you can show that your code has no side-effects (i.e. is pure), then you could brute-force over all input values and compare. — Oliver Charlesworth
– Oliver Charlesworth, Commented Jul 15, 2015 at 17:56
Depends on what your programs do. If they generate some text output (or even number as a result) you can always compare results. Otherwise - just compare the code with BeyondCompare for example. — Artur
– Artur, Commented Jul 15, 2015 at 17:59
Yes but how do i understand that i have not just printed the result. — Erion Lika
– Erion Lika, Commented Jul 15, 2015 at 17:59

Marko Topolnik · Accepted Answer · 2015-07-15 18:01:59Z

4

As proved by Alan Turing almost a century ago, no general algorithm exists which could even determine whether a function's evaluation will complete in finite time (see the Halting problem).

By implication there is no general algorithm which could decide whether two pieces of code have the same output. On the other hand, if you assume that the functions always complete in finite time, then a trivial algorithm that does what you want is one which simply runs the code in question for all possible inputs.

answered Jul 15, 2015 at 18:01

Marko Topolnik

201k31 gold badges336 silver badges455 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Oliver Charlesworth Over a year ago

That's assuming the set of "all possible inputs" is finite. Consider strings/input-streams/network-connections as a counterexample.

Marko Topolnik Over a year ago

@OliverCharlesworth We don't even need to go that far---it stops being feasible just around a length of 20 bytes or so :)

Sumeet · Accepted Answer · 2015-07-15 18:15:03Z

Ignoring all the intricate details, Here is a naive school level algorithm to do this.

Test 1:First count the number of variables used in both programs.See the difference between them and decide a threshold difference for passing the test depending upon your needs and the programs you are comparing.

Test 2:Then determine the data type of variable that is used maximum times in both programs and if the data type differs then in most cases the programs will differ but again, It is not always the case.

Test 3:You could compare their cyclomatic complexity. This could also help in determining the likeness as it tell you about the number of independent paths in your programs.

There can be many more tests like comparing the number of blocks or function calls and you could set the rate of likeness to be equal to percentage of tests passed.

But of course This algorithm is naive and will have test cases for which it will fail, but for the basics and a start I think it should do fine.

GregH · Accepted Answer · 2015-07-15 18:02:13Z

0

No such algorithm exists unfortunately.

Why? See: https://www.cs.rochester.edu/~nelson/courses/csc_173/computability/undecidable.html

specifically the section labeled "The Equivalence Problem is Undecidable"

To truly understand why it would help to familiarize yourself in such areas as decidability, turing machines, and various other types of automata

answered Jul 15, 2015 at 18:02

GregH

5,53112 gold badges62 silver badges115 bronze badges

Comments

user3437460 · Accepted Answer · 2015-07-15 19:02:57Z

If you are looking forward to write a program which detects similar work, you can try searching how software Turnitin works.

While it is not possible to come out with an algorithm which works for all cases. There is some work around solutions you can adopt.

You can create "synonyms" for certain words/code. Example x++ similar to x+=1, x=x+1, ++x and so on. However note that sometimes x++ and ++x mean totally different things in coding. So your program can never be 100% accurate to judge the similarity.
Your database of "synonyms" must be large enough to handle various cases.

However, take note that people can always bypass your detection by obfuscating their codes or adding confusion to it.

Example: instead of writing x++, they could write x=(x+2)-1 which means the same thing and no matter how large your database of synonyms are, you can never catch that.

If you program does not only check on codes, but on essays and journals as well. You can make a text analysis on writing patters. (For example using word length frequency count). Some authors may prefer to use shorter words, while some prefer longer words.
There are many other analysis you can come out with to make your program a dynamic one. Even though there is no such algorithm, but it is definitely possible to write such programs to detect plagiarism.

Collectives™ on Stack Overflow

Code matching algorithm

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related