Detecting Source Code Language From Code Snippets

Nishant Arora 15/Dec/2013
Facebook
Twitter
LinkedIn
Reddit

This is an interesting project!... The scenario goes somewhat like this:

I have a set of files, lot of files, each one of them has some source code in it. I do not know, the language they are written in, nor do these files have an extension associated to them. They are basically snippets, they can be in any programming language. Now I need to profile/catalog these files as to what language do they belong to.

How can you possibly achieve this? The first solution that probably came to my mind was to write regular expressions for each language. Well this is somewhat the answer and somewhat hectic. But even if I wrote all the regular expressions myself, some cases are gonna be real tricky (try c/c++/c#).

There are multiple questions of similar kind on stackoverflow, which are all duplicate of this particular answer http://stackoverflow.com/questions/475033/detecting-programming-language-from-a-snippet. Turned out it was a pretty bad ass problem and many people were suffering from it. But no one posted how to particularly achieve this. especially with CLI or an API which could resolve it.

Suddenly I realized there are multiple code highlighting plugins which are capable of detecting the language of entered source code. And voila! this was something i could use. after going through multiple code highlighting plugins I came across highlight.js written by Ivan Sagalaev now this is one of the best and most comprehensive code highlighting plugins available. The code is awesomely written and heavily contributed.

Now I could easily use this read snippets and recognize languages. The only issue, it's written in JavaScript and has a decent browser implementation. But I needed a CLI solution to build an API over it. Node.js sounds like a perfect candidate for this problem!... Yayee...

I finally present to you detectLang.js (http://code.nishantarora.in/langdetect.js)

Using this is as simple as it gets (Note: You will need node.js installed) once you get the code, you just need run the test file by

$ node langDetect.js <test file>

I have added many samples in the tests folders, feel free to test them out all. or you can run all the samples:

$ ./test.sh

Feel free to fork and use it to your will
Happy Hacking!

EDIT 1:

Note: Some sample codes for different languages can be located in the tests folder.

Note: The entire folder can be processed with the test.sh bash script provided. You will get an output similar to this.

EDIT 2:

Disclaimer: This script can only provide you with a crude mechanism to profile source code snippets. But this is the fastest and closest match to language detection I have. Please go through this conversation with Ivan before proceeding https://github.com/isagalaev/highlight.js/issues/334