I'm writing a programm in Python, my goal is to :
- read input xml file one line at a time
- for each line find "CH" attribute
- change attribute value : translate from french to portugese
- write changed line into output xml file
- as i manipulate texts in various languages i'd like to keep utf8 encoding to display foreign special characters in the output file
My code:
import os
import xml.etree.ElementTree as ET
from googletrans import Translator
with open("input file.txt", "r", encoding='utf-8') as input_file:
with open("output file.txt", "w", encoding='utf-8') as output_file:
# Read input file
for ligne in input_file:
# line parse
root = ET.fromstring(ligne)
# Change CH attribute value, translate from french fr to portugese pt
current_text= root.get("CH")
translator = Translator()
translated_text = translator.translate(dest="pt", src="fr", text=current_text)
root.attrib["CH"] = translated_text.text
# convert bytes to string
decoded_string = ET.tostring(root).decode("utf-8")
# write output file
output_file.write(decoded_string)
The problem is that in the output file i get non encoded chraracters, for example with the below input file:
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="la victoire est à nous"/>
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="vive l'empereur"/>
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
i get this result:
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
instead of expected result :
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
i have checked with displays that the translated_text.text is well formated ("A vitória é nossa"), but decoded_string value is wrong despite utf8 coding specification : <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
i do not understand why i have this result, could you please help me?