Compression algorithms specifically optimized for HTML content? [closed]

Question

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed 11 months ago.

Improve this question

Are there any compression algorithms -- lossy or lossless -- that have been specifically adapted to deal with real-world (messy and invalid) HTML content?

If not, what characteristics of HTML could we take advantage of to create such an algorithm? What are the potential performance gains?

Also, I'm not asking the question to serve such content (via Apache or any other server), though that's certainly interesting, but to store and analyze it.

Update: I don't mean GZIP -- that's obvious -- but rather an algorithm specifically designed to take advantage of characteristics of HTML content. For example, the predictible tag and tree structure.

No, lossy in the sense that: <p>whatever <em>word</em>whatever whatever Might become: <p>whatever <em>word</em> whatever whatever</p> ...in the same way that tidy maintains structure and yet cleans up page code. — hmason
– hmason, Commented Mar 10, 2010 at 17:30
From what i understand, most compression algorithms are based on statistical repetition of content. The starting and closing tags fall within those boundaries and any decent readily available compression algorithm should suffice, cause after all HTML is all ASCII. I am not sure the type of analysis you want to run on the stored data, however an important aspect would be the decompression cost involved on such compressed content before you can run analysis on it. — Sabeen Malik
– Sabeen Malik, Commented Mar 10, 2010 at 17:54

Ryan Hamilton · Accepted Answer · 2016-10-19 09:59:39Z

5

Brotli is a specialized HTML/english compression algorithm.

Source: https://en.wikipedia.org/wiki/Brotli

Unlike most general purpose compression algorithms, Brotli uses a pre-defined 120 kilobyte dictionary. The dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents.[6][7] A pre-defined dictionary can give a compression density boost for short data files.

answered Oct 19, 2016 at 9:59

Ryan Hamilton

2,62518 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mjv · Accepted Answer · 2010-03-10 18:06:47Z

I do not know of "off-the-shelf" compression library explicitly optimized for HTML content.

Yet, HTML text should compress quite nicely with generic algorithms (do read the bottom of this answer for better algorithms). Typically all variations on Lempel–Ziv perform well on HTML-like languages, owing to the highly repeatitive of specific language idioms; GZip, often cited uses such a LZ-based algoritm (LZ77, I think).

An idea to maybe improve upon these generic algorithms would be to prime a LZ-type circular buffer with the most common html tags and patterns at large. In this fashion, we'd reduce the compressed size by using citations from the very first instance of such a pattern. This gain would be particularly sensitive on smaller html documents.

A complementary, similar, idea, is to have the compression and decompression methods imply (i.e. not send) the info for other compression's algorithm of an LZ-x algorithm (say the Huffman tree in the case of LZH etc.), with statistics specific to typical HTML being careful to exclude from characters count the [statistically weighted] instances of character encoded by citation. Such a filtered character distribution would probably become closer to that of plain English (or targeted web sites' national languge) than the complete HTML text.

Unrelated to the above [educated, I hope] guesses, I started searching the web for information on this topic.

' found this 2008 scholarly paper (pdf format) by Przemysław Skibiński of University of Wrocław. The paper's abstract indicates a 15% improvement over GZIP, with comparable compression speed.

I may be otherwise looking in the wrong places. There doesn't seem to be much interest for this. It could just be that the additional gain, relative to a plain or moderately tuned generic algorithm wasn't deemed sufficient enough to warrant such interest, even in the early days of Web-enabled cell phones (when bandwidth was at quite a premium...).

mrjoltcola · Accepted Answer · 2010-03-10 17:27:57Z

2

About the only "lossy" I am willing to deal with in HTML content, messy or not, is whitespace flattening. This is a typical post-publish step that high volume sites perform on their content, also called flattening.

You can also flatten large Javascript libs using the YUI compressor, which renames all Javascript vars to short names, removes whitespace, etc. It is very important for large apps using kits like ExtJS, Dojo, etc.

answered Mar 10, 2010 at 17:27

mrjoltcola

20.9k1 gold badge62 silver badges81 bronze badges

Comments

vladr · Accepted Answer · 2010-03-10 18:08:00Z

2

Is gzip compression not sufficient for your needs? It gives you about 10:1 compression ratio, not only with HTML contents but also with JavaScript, CSS etc. files, and is readily available on most servers or reverse proxies (e.g. Apache's mod_deflate, Nginx's NginxHttpGzipModule etc.) and all modern browsers (you can instruct both Apache and Nginx to skip compression for specific browsers based on User-Agent.)

You'll be surprised how close gzip compression comes to optimal. Some people have suggested minifying your files; however, unless your files contain lots of comments (which the minifier can discard completely, i.e. what you probably referred to as "lossy" -- but something you probably don't want to do with HTML anyway, not unless you're sure that none of your <script> or <style> tags are inside HTML comments  to accommodate antediluvian browsers), remember that minifying achieves most of its gains from a technique similar to (yet more limited than) DEFLATE -- so expect a minified file to be larger or much larger than a gzipped original (particularly true with HTML, in which you are stuck with W3C's tags and attributes, and only gzip can help you there), and that gzipping a minified file will give you minimal gain over gziping the original file (again, unless the original file contained lots of comments which can be safely discarded by a minifier.)

edited Mar 10, 2010 at 18:08

answered Mar 10, 2010 at 17:32

vladr

67k18 gold badges131 silver badges132 bronze badges

3 Comments

hmason Over a year ago

gzip is sufficient, but I'm a computer scientist -- I want optimal. :)

Bryan Oakley Over a year ago

I think the question is more along the lines of "I know I can compress HTML by removing extraneous whitespace, what other compression techniques can I perform and still have valid HTML?"

vladr Over a year ago

Not at the time when the question (and the gzip answers) were originally posted. Moreover, I clearly make the case that non-gzip techniques are in most cases futile. You may not have read the post thoroughly, but the "...what other techniques...", namely removing HTML comments, is also addressed.

leppie · Accepted Answer · 2010-03-10 18:05:19Z

1

Use S-expressions instead, saves you a number of characters per tag :)

answered Mar 10, 2010 at 18:05

leppie

118k18 gold badges201 silver badges300 bronze badges

Comments

Sabeen Malik · Accepted Answer · 2010-03-10 17:25:01Z

0

if i understand your question correctly what you need is gz compression, which is available pretty easily with Apache.

answered Mar 10, 2010 at 17:25

Sabeen Malik

10.9k4 gold badges37 silver badges51 bronze badges

1 Comment

user253984 Over a year ago

+1: Gzip is optimized for text content and HTML is often just simple ASCII. There are modules for Apache that can gzip on-the-fly.

Tronic · Accepted Answer · 2010-03-10 17:30:13Z

0

Run your code thru some HTML minificator/obfuscator that removes as much markup as possible, then let your web server compress it with gzip.

answered Mar 10, 2010 at 17:30

Tronic

10.5k2 gold badges44 silver badges53 bronze badges

Comments

Mark Ransom · Accepted Answer · 2010-03-10 17:31:26Z

0

No, there are not any HTML-specific compression algorithms, because the general-purpose ones have proved adequate.

The potential gains would come from knowing ahead of time the likely elements of an HTML page - you could start with a predefined dictionary that would not have to be part of the compressed stream. But this would not give a noticeable gain, as compression algorithms are extraordinarily good at picking out common sub-expressions on the fly.

answered Mar 10, 2010 at 17:31

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

Comments

AndiDog · Accepted Answer · 2010-03-10 17:31:44Z

0

You would usually use a common algorithm like gzip which is supported by most browsers through the HTTP protocol. The Apache documentation shows how to enable mod_deflate without breaking the browser support of your website.

Additionally you can minimize static HTML files (or do that dynamically).

answered Mar 10, 2010 at 17:31

AndiDog

70.6k21 gold badges166 silver badges208 bronze badges

Comments

Hedzer · Accepted Answer · 2014-03-25 00:43:57Z

0

You could consider each unique grouping (i.e. tags & attribs) as a symbol, determine a minimum symbol count and re-encode using Shannon's entropy; this would generate one large blob of bytes with maximal compression. I will say that this may not be much better than gzip.

answered Mar 25, 2014 at 0:43

Hedzer

1196 bronze badges

Comments

CJ Gaconnet · Accepted Answer · 2023-09-29 22:04:17Z

There is now Efficient XML Interchange (EXI) Format. From the abstract:

EXI is a very compact representation for the Extensible Markup Language (XML) Information Set that is intended to simultaneously optimize performance and the utilization of computational resources. The EXI format uses a hybrid approach drawn from the information and formal language theories, plus practical techniques verified by measurements, for entropy encoding XML information. Using a relatively simple algorithm, which is amenable to fast and compact implementation, and a small set of datatype representations, it reliably produces efficient encodings of XML event streams.

The Working Group's page links to other useful documents including a brief primer as well as empirical evaluation.

"Fast Infoset" is another compact binary XML representation.

These are for valid XML documents, so they may not meet your question's requirements for handling messy & invalid HTML.

Collectives™ on Stack Overflow

Compression algorithms specifically optimized for HTML content? [closed]

11 Answers 11

Comments

Comments

Comments

3 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

Comments

Comments

Comments

3 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Related