3

For University, I'm doing research into compression techniques. One experiment I'm trying to perform is replacing certain Welsh language letters (which are digraphs) with a single character.

It would be my thought that replacing two characters with a single character would reduce the file size (however marginally) or at worst keep the file size the same. I have made a Python script to do this, however it is actually increasing the file size. The original file I tested this on was ~74,400KB, and the output program was ~74,700KB.

Here is my Python code:

replacements = {
        'ch':'ƒ',
        'Ch':'†',
        'CH':'‡',
        'dd':'Œ',
        'Dd':'•',
        'DD':'œ',
        'ff':'¤',
        'Ff':'¦',
        'FF':'§',
        'ng':'±',
        'Ng':'µ',
        'NG':'¶',
        'll':'º',
        'Ll':'¿',
        'LL':'Æ',
        'ph':'Ç',
        'Ph':'Ð',
        'PH':'×',
        'rh':'Ø',
        'Rh':'Þ',
        'RH':'ß',
        'th':'æ',
        'Th':'ç',
        'TH':'ð',
        }
print("Input file location: ")
inLoc = input("> ")
print("Output file location: ")
outLoc = input("> ")

with open(inLoc, "r",encoding="Latin-1") as infile, open(outLoc, "w", encoding="utf-8") as outfile:
for line in infile:
    for src, target in replacements.items():
        line = line.replace(src, target)
    outfile.write(line)

When I tested it on a very small text file a few lines long, I looked at the output and it was as expected.

Input.txt:

Lle wyt ti heddiw?

Ddoe es i at gogledd Nghymru.

Output.txt:

¿e wyt ti heŒiw?

•oe es i at gogleŒ µhymru.

Can anyone explain what is happening?

2 Answers 2

8

You're changing the encoding of the file. latin-1 is always 1-byte per character, but utf-8 isn't, so some of your special characters are being encoded with multiple bytes, resulting in the increase in size.

Sign up to request clarification or add additional context in comments.

7 Comments

Because some of the replacements are outside the latin-1 range (e.g. 'ƒ'), you can't write back as latin-1. So if you had other latin-1 characters like é in the original text, those are getting expanded too; they're all one byte in latin-1, and two bytes in UTF-8. The OP's conversion from ASCII digraph (two bytes in latin-1 or UTF-8) to single low ordinal Unicode character encoded in UTF-8 doesn't actually cost anything, since the UTF-8 encoding is likely two bytes as well (all ordinals below 0x7ff are two bytes in UTF-8). But the rest of your non-ASCII latin-1 is bloating.
Only way to reduce size is to make the mapping map digraphs to somewhere in the latin-1 range, then write back as latin-1; that would save one byte on every digraph replacement, but it risks losing data since you might have had those latin-1 characters appear for other reasons, and you can't distinguish those created by the digraph transform from those that were in the original text.
@ShadowRanger So is there no way that I can replace the digraphs with a single character of the same size (Apart from as you said still using Latin-1 and possibly losing characters)?
@hjalpmig: It wouldn't lose characters, just lose differentiation. Most of the mappings you have are already within the latin-1 space, but others are outside it. If you found a mapping that used only latin-1 outputs, you could save a small amount of space by writing as latin-1. Similarly, if this can be made actually Welsh, with no non-Welsh characters, you might be able to use latin8 (a variant encoding with Celtic language characters) to encode with one byte per character, but the problem is that locale encoded stuff will open by default in the system locale, and no one uses cy_GB.
@hjalpmig: Really though, the best solution is to just use portable UTF-8 (or on Windows, UTF-16) and compress the text with a dedicated compression scheme (e.g. gzip) if space is a big issue. Fiddling around the margins by shaving a byte here and there is not compression.
|
0

""UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters."" - https://en.wikipedia.org/wiki/Unicode

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.