4

Question

How to minify HTML using C++?

Resources

An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.

Current code

This is my interpretation in c++ of the following answer.

The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs

#include <boost/regex.hpp>
void minifyhtml(string* s) {
  boost::regex nowhitespace(
    "(?ix)"
    "(?>"           // Match all whitespans other than single space.
    "[^\\S ]\\s*"   // Either one [\t\r\n\f\v] and zero or more ws,
    "| \\s{2,}"     // or two or more consecutive-any-whitespace.
    ")"             // Note: The remaining regex consumes no text at all...
    "(?="           // Ensure we are not in a blacklist tag.
    "[^<]*+"        // Either zero or more non-"<" {normal*}
    "(?:"           // Begin {(special normal*)*} construct
    "<"             // or a < starting a non-blacklist tag.
    "(?!/?(?:textarea|pre|script)\\b)"
    "[^<]*+"        // more non-"<" {normal*}
    ")*+"           // Finish "unrolling-the-loop"
    "(?:"           // Begin alternation group.
    "<"             // Either a blacklist start tag.
    "(?>textarea|pre|script)\\b"
    "| \\z"         // or end of file.
    ")"             // End alternation group.
    ")"             // If we made it here, we are not in a blacklist tag.
  );
  
  // @todo Don't remove conditional html comments
  boost::regex nocomments("<!--(.*)-->");
  
  *s = boost::regex_replace(*s, nowhitespace, " ");
  *s = boost::regex_replace(*s, nocomments, "");
}

Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.

6
  • 7
    There is no such thing as minifying HTML. Every single whitespace character is potentially meaningful, such as within a <textarea> or <pre> or if the container has white-space:pre-wrap. Add in the fact that JavaScript can change this on the fly, and you have absolutely no way of knowing what should be kept and what can be safely removed. At least, not automatically. Manually, sure, you can minify your HTML. Commented Apr 21, 2013 at 18:07
  • @Kolink I knew someone would tell me this :D I'm writing the code though, so I have full awareness of the restrictions it applies. Commented Apr 21, 2013 at 18:17
  • 2
    Removing the space in “> <” isn’t only an error in textarea etc., it also affects the layout in other code (essentially whenever inline tags are involved). If you really want to minify HTML, use a proper HTML parser, parse the input properly and write it back out. Commented Apr 21, 2013 at 18:41
  • @KonradRudolph God point on the inline elements, will remove that part then :) Commented Apr 21, 2013 at 18:44
  • 1
    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems ― attributed to jwz Commented Jun 12, 2013 at 5:34

1 Answer 1

1

Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.

You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.

I think you might be able to use xml parser or you could search for xml parser with html support.

In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.

Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.

Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.