What encoding/compression algorithm is this?

Question

I'm trying to reverse engineer a binary file format, but it has no magic bytes, and no specific extension. I can influence only a single aspect of the file: a short string. By trying different strings, I was able to figure out how data is stored in the file. It seems that the whole file uses some sort of simple encoding. I hope that finding the exact encoding allows me to narrow down my search for the file format. I know the file is generated by a Windows program written in C++.

Now, after much trial-and-error, I found that some sections of the file are encoded in runs. Each run starts with a byte that indicates how many bytes will follow and where the data is retrieved.

000ddddd (1 byte)
Take the following (ddddd)+1 bytes from the encoded data.
111····· ···ddddd ···bbbbb (3 bytes)
Go back (bbbbb)+1 bytes in the decoded data, and take the next (ddddd)+9 bytes from it.
ddd····· ··bbbbbb (2 bytes)
Go back (bbbbbb)+1 bytes in the decoded data, and take the next (ddd)+2 bytes from it.

Here's an example:

This is the start of the file, with the UTF-16 string abracadabra encoded in it:

First 224 bytes of the file

  .     .  .  a  .  b  .  r     .  .  c     .  .  d     .  €  .
  0C 20 03 04 61 00 62 00 72 20 05 00 63 20 03 00 64 20 03 80 0D

To decode the string:

  0C                      number of Unicode chars: 12 (11 chars + \0)
  20 03       . . .       ??
  04                      next 5
  61 00       a .
  62 00       b .
  72          r
  20 05       . a .       back 6, take 3
  00                      next 1
  63          c
  20 03       . a .       back 4, take 3
  00                      next 1
  64          d
  20 03       . a .       back 4, take 3
  80 0D       b . r . a . back 14, take 6

This results in (UTF-16):

  a  .  b  .  r  .  a  .  c  .  a  .  d  .  a  .  b  .  r  .  a  .
  61 00 62 00 72 00 61 00 63 00 61 00 64 00 61 00 62 00 72 00 61 00

However, I have no clue as to what encoding/compression algorithm this might be. It looks like some variant of LZ that doesn't use a dictionary (like LZ77), but so far I haven't been able to find any algorithm that matches this description. I'm also not sure whether the entire file is encoded like this, or only portions of it.

Do you know this encoding? Or do you have any hints for things I might look for in the file to identify the encoding?

@Hidde I can command the program to give me a big file with a particular 18 character string in it that I choose. These are the strings that I chose, and their corresponding encoded versions in the resulting file. I haven't been able to find any other strings in the binary file, but that may be due to the encoding. — Daniel A.A. Pelsmaeker
– Daniel A.A. Pelsmaeker, Commented Mar 1, 2014 at 23:28
It appears the first byte is the length of your string in hexidecimal. — Jan Nielsen
– Jan Nielsen, Commented Mar 2, 2014 at 0:09
If this encoding is for compression, it would have to win the prize for worst compression approach, ever! — Mark Adler
– Mark Adler, Commented Mar 2, 2014 at 18:29
Please provide the result for an empty string, a single character, and a single non-ASCII character (i.e. something that requires UTF-8 or UTF-16 to encode). — Mark Adler
– Mark Adler, Commented Mar 2, 2014 at 22:48

Ben Jackson · Accepted Answer · 2014-03-03 05:47:45Z

1

After your edit I think it's LZF with the following differences to your observations:

The magic header and indication of compressed vs uncompressed has been removed in your example (not too surprising if it's embedded in a file).
You took the block length as one byte, but it is two bytes and big-endian, so the prior 0x00 is part of the length, which still works.

answered Mar 3, 2014 at 5:47

Ben Jackson

94.6k9 gold badges103 silver badges154 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Daniel A.A. Pelsmaeker Over a year ago

Yes! You seem to have hit the nail on the head. I suspect it is one or more LZF-compressed files in a container.

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Could be NTFS compression, which is LZNT1. This idea is supported by the platform and the apparent 2-byte structure, along with the byte-alignment of the actual data.

The following elements are specific to this algorithm.

Chunks: Segments of data that are compressed, uncompressed, or that denote the end of the buffer.

Chunk header: The header for a compressed or uncompressed chunk of data.

Flag bytes: A bit flag whose bits, read from low order to high order, specify the formats of the data elements that follow. For example, bit 0 corresponds to the first data element, bit 1 to the second, and so on. If the bit corresponding to a data element is set, the element is a 2-byte compressed word; otherwise, it is a 1-byte literal value.

Flag group: A flag byte followed by zero or more data elements, each of which is a single literal byte or a 2-byte compressed word.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 2, 2014 at 23:39

Ben Jackson

94.6k9 gold badges103 silver badges154 bronze badges

4 Comments

Daniel A.A. Pelsmaeker Over a year ago

You may be onto something. I'll look into it.

Daniel A.A. Pelsmaeker Over a year ago

I looked deeper into the file, and it seems that not just the string, but the whole file (or parts of it at least) are compressed using some LZ variant. However, it isn't LZNT1, as that starts with a 16-bit chunk header indicating the size of the chunk. As seen in the screenshot, the first 16-bit value is 1, and the chunk is most certainly not 1 byte long. But I'll keep on searching. +1 for the suggestion.

Ben Jackson Over a year ago

@Virtlink: There might be an application-specific header on the front as well. If you scan through, do any of the bytes seem reasonable for a chunk size?

Daniel A.A. Pelsmaeker Over a year ago

I can't find anything that indicates a chunk size, nor any header or border-like structures at repeating intervals (e.g. every 4K). However, I've been able to reverse-engineer the encoding used, but I don't recognize it and can't find anything on it. Please see my heavily updated post.

Collectives™ on Stack Overflow

What encoding/compression algorithm is this?

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related