For University, I'm doing research into compression techniques. One experiment I'm trying to perform is replacing certain Welsh language letters (which are digraphs) with a single character.
It would be my thought that replacing two characters with a single character would reduce the file size (however marginally) or at worst keep the file size the same. I have made a Python script to do this, however it is actually increasing the file size. The original file I tested this on was ~74,400KB, and the output program was ~74,700KB.
Here is my Python code:
replacements = {
'ch':'ƒ',
'Ch':'†',
'CH':'‡',
'dd':'Œ',
'Dd':'•',
'DD':'œ',
'ff':'¤',
'Ff':'¦',
'FF':'§',
'ng':'±',
'Ng':'µ',
'NG':'¶',
'll':'º',
'Ll':'¿',
'LL':'Æ',
'ph':'Ç',
'Ph':'Ð',
'PH':'×',
'rh':'Ø',
'Rh':'Þ',
'RH':'ß',
'th':'æ',
'Th':'ç',
'TH':'ð',
}
print("Input file location: ")
inLoc = input("> ")
print("Output file location: ")
outLoc = input("> ")
with open(inLoc, "r",encoding="Latin-1") as infile, open(outLoc, "w", encoding="utf-8") as outfile:
for line in infile:
for src, target in replacements.items():
line = line.replace(src, target)
outfile.write(line)
When I tested it on a very small text file a few lines long, I looked at the output and it was as expected.
Input.txt:
Lle wyt ti heddiw?
Ddoe es i at gogledd Nghymru.
Output.txt:
¿e wyt ti heŒiw?
•oe es i at gogleŒ µhymru.
Can anyone explain what is happening?