Syntax recognizer in python

Question

I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:

Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.

Any clues are deeply appreciated.

What is the scope of your project? How many languages do you need it to detect? Are false positives or false negatives more important to minimize? If you don't care what kind of language you detect, programmers.stackexchange.com/questions/87611/… — Patashu
– Patashu, Commented May 7, 2013 at 4:49
Project is medium size, and will be used to filter harvested sources, so false negatives are not a worry, false positives are important to avoid. About languages I guess as much as possible. — PepperoniPizza
– PepperoniPizza, Commented May 7, 2013 at 4:50
Dupe of stackoverflow.com/questions/475033/… ? At the very least, the linguist looks like pretty much what you're looking for. (Or as close as you're likely to find.) — Lucas Wiman
– Lucas Wiman, Commented May 9, 2013 at 5:09
This SO question probably has the answer you're looking for stackoverflow.com/questions/325165/… — elssar
– elssar, Commented May 9, 2013 at 5:36
Does this answer your question? Is there a library that will detect the source code language of a block of code? — MatthewMartin
– MatthewMartin, Commented Jan 17, 2021 at 17:44

Jokester · Accepted Answer · 2013-05-09 04:54:00Z

3

+50

Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.

answered May 9, 2013 at 4:54

Jokester

5,6373 gold badges33 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

PepperoniPizza Over a year ago

Could you please post an example or a package that does this ? All that I saw need you to specify the highlihting language.

Lucas Wiman Over a year ago

Depending on efficiency requirements, you could just loop through all the supported languages and see if any of them parse.

Jokester Over a year ago

@PepperoniPizza My applogize. I found that many packages actually detect language by extension. Anyway I found a js implementation of code-language relevance.

PepperoniPizza Over a year ago

@jokester, cool, that's somehow what I was looking for, it's a shame it's not python written.

kiriloff · Accepted Answer · 2013-05-07 04:53:14Z

3

You could have a look at methods around baysian filtering.

answered May 7, 2013 at 4:53

kiriloff

26.5k40 gold badges163 silver badges235 bronze badges

Comments

Seth Curry · Accepted Answer · 2013-05-10 14:44:52Z

My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).

This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.

If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.

Collectives™ on Stack Overflow

Syntax recognizer in python

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related