Optimal dictionary for coding

Question

Given alphabet $A$ and a sequence $S$, determine the optimal dictionary $D$ of max word length $m$ that can compose the sequence.

Optimality can be given as a two-part cost $(L)$ = dictionary cost $(L(D))$ + encoding cost $(L(S|D))$, e.g.

$L(D) = \log_2(|A|) * \sum_{w \in D} |w| $
$L(S|D)$ is Huffman coding, using probabilities learned from the sequence.

Typical scenarios for my case: |A|=10, length of S=180, m=21

The goal is to find structure in the sequence, being encoded in words $w\in D$. In some cases, it is given there exists a word of length $m^*$. Please point to any references, libraries (python preferred) for the algorithm.

I do not see a question, Just a hard problem - tag reference request?. — greybeard
– greybeard, Commented Oct 7 at 20:57
For defining $L(S|D)$, what probabilities do you propose to use when generating the Huffman code? Do you want to assume a uniform distribution on $D$? — D.W.
– D.W. ♦, Commented Oct 20 at 11:11
@D.W. for $L(S/D)$, the probabilities should be learnt from the sequence. for $L(D)$. i am assuming, $log_2 (|A|) $ bits for each alphabet. I made an error, in the question above, it should be $log_2 (|A|)$ instead of $log_2(|D|)$ — samrat
– samrat, Commented Oct 20 at 17:16
Don't you need about $|D| \log_2 |D| + \log_2 |A| \sum_{w \in D} |w|$ bits to encode $D$? Should $L(D)$ be defined as that larger values? — D.W.
– D.W. ♦, Commented Oct 20 at 17:45
I'm not sure I understand what "$L(S|D)$ is Huffman coding" means. Can you define it more precisely? Do you mean the number of bits needed to represent $S$ as a sequence of words from $D$ using a a Huffman code for $D$? But then you cannot use probabilities derived from the frequency of each word of $D$ found in $S$, unless you include those frequencies explicitly in the encoding. So should $L(S|D)$ be the number of bits needed to encode the Huffman code and those frequencies, and then to represent $S$ using that Huffman code? It would help to be more explicit. — D.W.
– D.W. ♦, Commented Oct 20 at 17:45

D.W. · Accepted Answer · 2025-10-20 11:17:53Z

0

I suspect finding the exact optimum will likely be NP-hard, but you might be interested in the Sequitur algorithm, the Re-Pair algorithm, and methods for byte-pair encoding. Doing a literature search for related research papers might find other techniques.

answered Oct 20 at 11:17

D.W.♦

169k23 gold badges236 silver badges519 bronze badges

Add a comment |

Stack Exchange Network

Optimal dictionary for coding

1 Answer 1

Your Answer

Hot Network Questions

Optimal dictionary for coding

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions