0

I'm building a Huffman compressor in Java.
I already have: The original text, the Huffman code table (Map<Character, String>), and the order of character appearance.

My current goal is to write the compressed result into a .bin file.
However, the output file is larger than the original text, because each '0' or '1' bit is being stored as a full byte instead of being packed into real bits.

Here’s my current implementation:

private static byte[] convertBitsToBytes(String bits) {
        int len = bits.length();
        int numBytes = (int) Math.ceil(len / 8.0);
        byte[] bytes = new byte[numBytes];

        for (int i = 0; i < len; i++) {
            if (bits.charAt(i) == '1') {
                bytes[i / 8] |= (byte) (1 << (7 - (i % 8)));
            }
        }
        return bytes;
    }

    public static void saveBinaryFile(File file, String originalText, Map<Character, String> huffmanTable) {
        try (FileOutputStream fos = new FileOutputStream(file)) {
            for (char c : originalText.toCharArray()) {
                String huffmanCode = huffmanTable.get(c);
                if (huffmanCode != null) {
                    fos.write((byte) c);

                    byte[] compressedBytes = convertBitsToBytes(huffmanCode);
                    fos.write(compressedBytes);
                }
            }
            System.out.println("Binary file saved successfully.");
        } catch (IOException ex) {
            System.out.println("Error saving binary file: " + ex.getMessage());
        }
    }

I tried writing both the character and its Huffman binary string directly into the binary file.
Each bit ('0' or '1') was written as a full byte, using DataOutputStream.writeByte().

I expected the resulting .bin file to contain the original character followed by its compressed bit sequence — and overall to weigh less than the original text file from which the data was taken.

However, the file ended up larger than the text file, because each '0' and '1' is still stored as one byte instead of real bits.
I’m trying to find a way to make it truly compressed by packing the bits efficiently. Result should be "h111e10l10l10o0" in the binary file imaging I had those binary codes.

2
  • 1
    You have code commented Not use yet; why haven't you tried writing to your output file from what it returns? Commented Oct 16 at 15:34
  • Hi, from what I understand now , convertBitsToBytes works for char instead for a string, so bits.length is always 1. The reason is that you read the file and from the file you've received only a one char. Commented Oct 16 at 16:40

2 Answers 2

1

Something like:

    int bin = 0;
    int n = 0;
    for (int i = 0; i < len; i++) {
        bin = (bin << 1) | (bits.charAt(i) & 1);
        n++;
        if (n == 8) {
            // write the byte bin to the output
            n = 0;
            bin = 0;
        }
    }
    if (n != 0) {
        bin <<= 8 - n;
        // write the byte bin to the output
    }   
Sign up to request clarification or add additional context in comments.

Comments

0

Calling convertBitsToBytes on each symbol individually means that the codes for adjacent symbols can never combine to properly fill the bytes. Most bytes will be partially "unused" (all 8 bits have to exist of course, but some of them wouldn't be meaningful), and at least one byte is used per symbol, which rules out compression if the symbols were originally single-byte letters themselves. Also, a normal Huffman decoder would not be able to decompress that data, since it has unexpected "gaps" that normally don't exist.

You could first append all codes into a long string, and apply convertBitsToBytes to that. Then adjacent codes can combine. This is not a great solution, just a simple trick to get you going, perhaps sufficient for a school exercise, I don't consider it a serious implementation technique since it relies on building a string in memory that's 8x the size of the compressed file. It's fundamentally inefficient.

The "proper" (IMO) solution is to buffer up bits in an integer (int or long) and write out groups of 8 as bytes when you have them (you can save up multiple bytes if you want, but be careful to keep the buffer empty enough that you can append your largest symbol to it), so every byte is fully used (except usually the last byte, containing the bits left in the buffer in the end). That way adjacent codes still combine, but without building up a potentially large string. By the way it's a bit simpler and more efficient to do this if codes are represented as pairs of integers, one integer to hold the bit pattern of the code and another integer to hold the length of the code (this limits the maximum length of a code, but that's not a serious problem and in fact there is often an even lower length limit imposed in order to simplify table-driven Huffman decoding).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.