Python - Program to reduce file size is increasing file size

Question

For University, I'm doing research into compression techniques. One experiment I'm trying to perform is replacing certain Welsh language letters (which are digraphs) with a single character.

It would be my thought that replacing two characters with a single character would reduce the file size (however marginally) or at worst keep the file size the same. I have made a Python script to do this, however it is actually increasing the file size. The original file I tested this on was ~74,400KB, and the output program was ~74,700KB.

Here is my Python code:

replacements = {
        'ch':'ƒ',
        'Ch':'†',
        'CH':'‡',
        'dd':'Œ',
        'Dd':'•',
        'DD':'œ',
        'ff':'¤',
        'Ff':'¦',
        'FF':'§',
        'ng':'±',
        'Ng':'µ',
        'NG':'¶',
        'll':'º',
        'Ll':'¿',
        'LL':'Æ',
        'ph':'Ç',
        'Ph':'Ð',
        'PH':'×',
        'rh':'Ø',
        'Rh':'Þ',
        'RH':'ß',
        'th':'æ',
        'Th':'ç',
        'TH':'ð',
        }
print("Input file location: ")
inLoc = input("> ")
print("Output file location: ")
outLoc = input("> ")

with open(inLoc, "r",encoding="Latin-1") as infile, open(outLoc, "w", encoding="utf-8") as outfile:
for line in infile:
    for src, target in replacements.items():
        line = line.replace(src, target)
    outfile.write(line)

When I tested it on a very small text file a few lines long, I looked at the output and it was as expected.

Input.txt:

Lle wyt ti heddiw?

Ddoe es i at gogledd Nghymru.

Output.txt:

¿e wyt ti heŒiw?

•oe es i at gogleŒ µhymru.

Can anyone explain what is happening?

gct · Accepted Answer · 2016-04-18 16:10:49Z

8

You're changing the encoding of the file. latin-1 is always 1-byte per character, but utf-8 isn't, so some of your special characters are being encoded with multiple bytes, resulting in the increase in size.

answered Apr 18, 2016 at 16:10

gct

14.8k16 gold badges72 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

ShadowRanger Over a year ago

Because some of the replacements are outside the latin-1 range (e.g. 'ƒ'), you can't write back as latin-1. So if you had other latin-1 characters like é in the original text, those are getting expanded too; they're all one byte in latin-1, and two bytes in UTF-8. The OP's conversion from ASCII digraph (two bytes in latin-1 or UTF-8) to single low ordinal Unicode character encoded in UTF-8 doesn't actually cost anything, since the UTF-8 encoding is likely two bytes as well (all ordinals below 0x7ff are two bytes in UTF-8). But the rest of your non-ASCII latin-1 is bloating.

ShadowRanger Over a year ago

Only way to reduce size is to make the mapping map digraphs to somewhere in the latin-1 range, then write back as latin-1; that would save one byte on every digraph replacement, but it risks losing data since you might have had those latin-1 characters appear for other reasons, and you can't distinguish those created by the digraph transform from those that were in the original text.

hjalpmig Over a year ago

@ShadowRanger So is there no way that I can replace the digraphs with a single character of the same size (Apart from as you said still using Latin-1 and possibly losing characters)?

ShadowRanger Over a year ago

@hjalpmig: It wouldn't lose characters, just lose differentiation. Most of the mappings you have are already within the latin-1 space, but others are outside it. If you found a mapping that used only latin-1 outputs, you could save a small amount of space by writing as latin-1. Similarly, if this can be made actually Welsh, with no non-Welsh characters, you might be able to use latin8 (a variant encoding with Celtic language characters) to encode with one byte per character, but the problem is that locale encoded stuff will open by default in the system locale, and no one uses cy_GB.

ShadowRanger Over a year ago

@hjalpmig: Really though, the best solution is to just use portable UTF-8 (or on Windows, UTF-16) and compress the text with a dedicated compression scheme (e.g. gzip) if space is a big issue. Fiddling around the margins by shaving a byte here and there is not compression.

|

aaro · Accepted Answer · 2016-04-18 16:13:49Z

0

""UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters."" - https://en.wikipedia.org/wiki/Unicode

answered Apr 18, 2016 at 16:13

aaro

114 bronze badges

Collectives™ on Stack Overflow

Python - Program to reduce file size is increasing file size

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related