8

Are there any compression algorithms -- lossy or lossless -- that have been specifically adapted to deal with real-world (messy and invalid) HTML content?

If not, what characteristics of HTML could we take advantage of to create such an algorithm? What are the potential performance gains?

Also, I'm not asking the question to serve such content (via Apache or any other server), though that's certainly interesting, but to store and analyze it.

Update: I don't mean GZIP -- that's obvious -- but rather an algorithm specifically designed to take advantage of characteristics of HTML content. For example, the predictible tag and tree structure.

3
  • 11
    lossy? <script> became <scr1pt> fine? Commented Mar 10, 2010 at 17:25
  • 1
    No, lossy in the sense that: <p>whatever <em>word</em>whatever whatever Might become: <p>whatever <em>word</em> whatever whatever</p> ...in the same way that tidy maintains structure and yet cleans up page code. Commented Mar 10, 2010 at 17:30
  • 1
    From what i understand, most compression algorithms are based on statistical repetition of content. The starting and closing tags fall within those boundaries and any decent readily available compression algorithm should suffice, cause after all HTML is all ASCII. I am not sure the type of analysis you want to run on the stored data, however an important aspect would be the decompression cost involved on such compressed content before you can run analysis on it. Commented Mar 10, 2010 at 17:54

11 Answers 11

5

Brotli is a specialized HTML/english compression algorithm.

Source: https://en.wikipedia.org/wiki/Brotli

Unlike most general purpose compression algorithms, Brotli uses a pre-defined 120 kilobyte dictionary. The dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents.[6][7] A pre-defined dictionary can give a compression density boost for short data files.

Sign up to request clarification or add additional context in comments.

Comments

4

I do not know of "off-the-shelf" compression library explicitly optimized for HTML content.

Yet, HTML text should compress quite nicely with generic algorithms (do read the bottom of this answer for better algorithms). Typically all variations on Lempel–Ziv perform well on HTML-like languages, owing to the highly repeatitive of specific language idioms; GZip, often cited uses such a LZ-based algoritm (LZ77, I think).

An idea to maybe improve upon these generic algorithms would be to prime a LZ-type circular buffer with the most common html tags and patterns at large. In this fashion, we'd reduce the compressed size by using citations from the very first instance of such a pattern. This gain would be particularly sensitive on smaller html documents.

A complementary, similar, idea, is to have the compression and decompression methods imply (i.e. not send) the info for other compression's algorithm of an LZ-x algorithm (say the Huffman tree in the case of LZH etc.), with statistics specific to typical HTML being careful to exclude from characters count the [statistically weighted] instances of character encoded by citation. Such a filtered character distribution would probably become closer to that of plain English (or targeted web sites' national languge) than the complete HTML text.


Unrelated to the above [educated, I hope] guesses, I started searching the web for information on this topic.

' found this 2008 scholarly paper (pdf format) by Przemysław Skibiński of University of Wrocław. The paper's abstract indicates a 15% improvement over GZIP, with comparable compression speed.

I may be otherwise looking in the wrong places. There doesn't seem to be much interest for this. It could just be that the additional gain, relative to a plain or moderately tuned generic algorithm wasn't deemed sufficient enough to warrant such interest, even in the early days of Web-enabled cell phones (when bandwidth was at quite a premium...).

Comments

2

About the only "lossy" I am willing to deal with in HTML content, messy or not, is whitespace flattening. This is a typical post-publish step that high volume sites perform on their content, also called flattening.

You can also flatten large Javascript libs using the YUI compressor, which renames all Javascript vars to short names, removes whitespace, etc. It is very important for large apps using kits like ExtJS, Dojo, etc.

Comments

2

Is gzip compression not sufficient for your needs? It gives you about 10:1 compression ratio, not only with HTML contents but also with JavaScript, CSS etc. files, and is readily available on most servers or reverse proxies (e.g. Apache's mod_deflate, Nginx's NginxHttpGzipModule etc.) and all modern browsers (you can instruct both Apache and Nginx to skip compression for specific browsers based on User-Agent.)

You'll be surprised how close gzip compression comes to optimal. Some people have suggested minifying your files; however, unless your files contain lots of comments (which the minifier can discard completely, i.e. what you probably referred to as "lossy" -- but something you probably don't want to do with HTML anyway, not unless you're sure that none of your <script> or <style> tags are inside HTML comments <!-- --> to accommodate antediluvian browsers), remember that minifying achieves most of its gains from a technique similar to (yet more limited than) DEFLATE -- so expect a minified file to be larger or much larger than a gzipped original (particularly true with HTML, in which you are stuck with W3C's tags and attributes, and only gzip can help you there), and that gzipping a minified file will give you minimal gain over gziping the original file (again, unless the original file contained lots of comments which can be safely discarded by a minifier.)

3 Comments

gzip is sufficient, but I'm a computer scientist -- I want optimal. :)
I think the question is more along the lines of "I know I can compress HTML by removing extraneous whitespace, what other compression techniques can I perform and still have valid HTML?"
Not at the time when the question (and the gzip answers) were originally posted. Moreover, I clearly make the case that non-gzip techniques are in most cases futile. You may not have read the post thoroughly, but the "...what other techniques...", namely removing HTML comments, is also addressed.
1

Use S-expressions instead, saves you a number of characters per tag :)

Comments

0

if i understand your question correctly what you need is gz compression, which is available pretty easily with Apache.

1 Comment

+1: Gzip is optimized for text content and HTML is often just simple ASCII. There are modules for Apache that can gzip on-the-fly.
0

Run your code thru some HTML minificator/obfuscator that removes as much markup as possible, then let your web server compress it with gzip.

Comments

0

No, there are not any HTML-specific compression algorithms, because the general-purpose ones have proved adequate.

The potential gains would come from knowing ahead of time the likely elements of an HTML page - you could start with a predefined dictionary that would not have to be part of the compressed stream. But this would not give a noticeable gain, as compression algorithms are extraordinarily good at picking out common sub-expressions on the fly.

Comments

0

You would usually use a common algorithm like gzip which is supported by most browsers through the HTTP protocol. The Apache documentation shows how to enable mod_deflate without breaking the browser support of your website.

Additionally you can minimize static HTML files (or do that dynamically).

Comments

0

You could consider each unique grouping (i.e. tags & attribs) as a symbol, determine a minimum symbol count and re-encode using Shannon's entropy; this would generate one large blob of bytes with maximal compression. I will say that this may not be much better than gzip.

Comments

0

There is now Efficient XML Interchange (EXI) Format. From the abstract:

EXI is a very compact representation for the Extensible Markup Language (XML) Information Set that is intended to simultaneously optimize performance and the utilization of computational resources. The EXI format uses a hybrid approach drawn from the information and formal language theories, plus practical techniques verified by measurements, for entropy encoding XML information. Using a relatively simple algorithm, which is amenable to fast and compact implementation, and a small set of datatype representations, it reliably produces efficient encodings of XML event streams.

The Working Group's page links to other useful documents including a brief primer as well as empirical evaluation.

"Fast Infoset" is another compact binary XML representation.

These are for valid XML documents, so they may not meet your question's requirements for handling messy & invalid HTML.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.