0

Is it possible to write a regular expression which checks if a string (some code) is minified?

Many PHP/JS obfuscators remove white space chars (among other things). So, the final minified code sometimes looks like this:

PHP:
$a=array();if(is_array($a)){echo'ok';}

JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}

in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code. Thanks in advice.

PURPOSES: web malware scanner

5
  • 1
    Is a regex solution a requirement? Or would procedural code be sufficient? (e.g. Checking if a piece of code has fewer than 5% whitespace would be a decent check--this isn't a check a regex can do, though, at least not without a for loop repeating the regex.) Commented Aug 21, 2011 at 19:15
  • Yes, i need a regex solution. But if you can provide a php function - it would be helpful. Thanks. Commented Aug 21, 2011 at 19:18
  • You may be able to come up with something that checks for code which appears to be minified. But there are many ways to minify scripts and each produces different output, so a formal concept of a script being minified or not is tough to nail down. Commented Aug 21, 2011 at 19:23
  • 1
    You could just minify it and compare the length with the original version; if it's approximately the same, then it's minified. Commented Aug 21, 2011 at 19:25
  • 1
    As for JavaScript, you can use Closure Compiler's API to compress it, and check how many characters have been saved. If it's less it probably is already minified. Commented Aug 21, 2011 at 19:28

5 Answers 5

2

I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:

/^[^\n\r]+(\r\n?|\n)?$/

That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.

Sign up to request clarification or add additional context in comments.

1 Comment

there are many situations where newlines are mandatory, and thus will never be stripped, also many minified files still contain header info most minifyers will keep headers starting with /*! ... */ intact
2

The short answer is "no", regex cannot do this.

Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.

Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.

But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.

4 Comments

@Ken - How on Earth do you expect a minified-code detector to help you with that? Both legit code and malware are frequently minified.
In that case we're definitly looking at a parser that somehow checks how optimized the code is. There really isn't a simple way for this. You'll have to have the logic to know what is and what isn't optimal in a scipt. Needless to say you're building something that can match jslint for checking code optimizations
@Justin: detector/scanner will point me to files which I should check for potential malware code. Like I said before I am not expecting a high accuracy. Just need something paranoid. Any false-positive thing is ok.
@Ken: If you're happy with false positives, then just use a random number generator. It will probably be exactly as accurate as a minimizer detector for malware detection, and much easier to write.
0

No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).

What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....

Maybe you can explain why you need to know this and we can try and find an alternative?

Comments

0

You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.

Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.

Just make sure that string literals like a="if ( b )" are excluded.

Comments

0

Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.