Can initial similarity between 2 files create similarity in the start of their Deflate compressed output?

Question

Assuming we have 2 files:

FILE 1

Uncompressed size: 793765 bytes
Compressed size: 604911 bytes
Data type: 99% Base64
Compression method : deflate dynamic

FILE 2

Uncompressed size: 793765 bytes
Compressed size: 604911 bytes
Data type: 99% Base64
Compression method : deflate dynamic

The 2 files share 20 initial bytes in uncompressed data. What are the chances that the compressed data after the zip headers share 3 initial bytes.

I have came across 5 files with the above characteristics and 4 of them had:

14 9A C7

and the other:

1C 9A C7

And i am confused of this.

If the above case is true then given the characteristics above one can easily guess the first bytes of the compressed file up 5 initial bytes.

Are you talking about password protected compression? If no, then I don't see what's your concern (the purpose of compression is to refactor content in a way that we need less space, but can still recompute the original file) — Rafalon
– Rafalon, Commented Oct 6 at 12:01
"What are the chances that the compressed data after the zip headers share 3 initial bytes" wouldn't that depend on what the rest of the data in the 32kb block are? A better question is why do you care, is it relevant that two files with the same prefix have similar deflate prefixes? Deflate isn't supposed to be cryptographically secure in any way, why would you expect this? — Charlieface
– Charlieface, Commented Oct 6 at 12:09
@Charlieface i am just curious about the inner workings of deflate compression especially on base64 input data and also i was given ZIP file with a .eml file inside encrypted and i guessed the first 3 bytes the same way as above and cracked it 4 days ago with a cost of 2**59 operations on gtx 1070 but before i publish the code i need to this to be clear — actgroup inc
– actgroup inc, Commented Oct 6 at 13:32
You can look at zlibc implementation code, or look at the spec. It doesn't offer encryption. ZIP is an unrelated algorithm, furthermore its compression uses various encryption algorithms, and I would hazard the one you broke is an insecure algorithm. You are barking up the wrong forest, never mind the wrong tree. — Charlieface
– Charlieface, Commented Oct 6 at 13:37

Mark Adler · Accepted Answer · 2025-10-07 00:36:52Z

1

It appears that all your compression is managing to do is to mostly undo the expansion of the Base64 encoding. Instead of compression, you should simply unencode the Base64 back to the original binary. Then you could try to compress that, though it appears the original binary may not be very compressible.

The header of a dynamic block depends on the entire uncompressed content of that block, not just the first 20 bytes or whatever of that content. You can think of the header as a summary of the statistics of that block's uncompressed data and string matches. Furthermore, those initial three bytes are a statistical summary of a statistical summary, as they represent the coding of the header, which is itself compressed. Those first three bytes are quite far removed from the information contained in the first 20 bytes of the uncompressed data.

There are constraints on those first three bytes that are independent of the uncompressed content. You can look at RFC 1951 for what the first 17 bits of a dynamic block represent. If you can depend on the fact that your input data is a Base64 encoding, then there will be probability distribution over a limited set of possible values of those three bytes, since the range of content bytes is limited, and especially since, other than the encoding, your data appears to be incompressible and so somewhat evenly distributed over those values.

For kicks, I ran one million trials of Base-64 encoding of random binary data, to see the distribution of the first three bytes. I saw about 200 different such sequences, and I'm sure I would have seen more if I ran more trials. Here are the most common sequences, preceded by their count out of a million, down to a 0.1% probability. Your two sequences are in here, but are far from the most common.

177522 14 9b c5
174308 14 9a c5
80347 14 9a 45
69894 14 9b 45
65311 14 9a b5
60991 14 9b b5
29937 14 99 c5
28529 1c 9b c5
25319 14 9a 35
25185 1c 9a c5
22425 14 9b 35
19393 14 9b c7
17294 14 99 45
16236 14 9b 47
13488 14 9a c7
13138 14 99 b5
12809 14 9b b7
11916 14 9a 47
9645 14 98 c5
8543 14 9a b7
8425 1c 9a b5
8321 1c 9b b5
7692 1c 9a 45
7650 1c 9b 45
6841 14 98 45
6663 14 99 35
5617 1c 9b c7
5176 14 9b 37
5013 14 98 b5
4336 14 9a 37
3670 1c 99 c5
3561 1c 9a c7
3169 14 98 35
3021 1c 9b b7
2524 14 97 c5
2458 14 97 45
2368 14 9c c5
1980 1c 9a b7
1875 14 97 b5
1831 1c 9b 47
1553 14 99 47
1500 14 99 c7
1487 1c 9a 35
1443 1c 99 45
1416 1c 99 b5
1401 1c 9a 47
1373 14 97 35
1316 1c 9b 35
1012 14 99 b7
1002 1c 98 c5
...

edited Oct 7 at 0:36

answered Oct 6 at 14:27

Mark Adler

115k15 gold badges136 silver badges182 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

actgroup inc Oct 6 at 17:22

So the fact that the compression ratio is the same and all parameters are the same on input and output doesn't mean anything and initial bytes match is purely coincidental?

Mark Adler Oct 6 at 22:46

Pretty much, yes. As I said, you have constrained the possible statistical summaries by always presenting Base64-encoded data, and further with what behaves as random input to the base-64 encoder. So you are drawing cards from a smaller deck, but still, yes, coincidental.

Mark Adler Oct 7 at 0:37

I added what the deck looks like to the answer.

Collectives™ on Stack Overflow

Can initial similarity between 2 files create similarity in the start of their Deflate compressed output?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related