3

I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:

  1. Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
  2. I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.

Any clues are deeply appreciated.

5

3 Answers 3

3
+50

Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.

Sign up to request clarification or add additional context in comments.

4 Comments

Could you please post an example or a package that does this ? All that I saw need you to specify the highlihting language.
Depending on efficiency requirements, you could just loop through all the supported languages and see if any of them parse.
@PepperoniPizza My applogize. I found that many packages actually detect language by extension. Anyway I found a js implementation of code-language relevance.
@jokester, cool, that's somehow what I was looking for, it's a shame it's not python written.
3

You could have a look at methods around baysian filtering.

Comments

2

My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).

This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.

If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.

1 Comment

Thanks for sharing, I will consider this also.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.