0

I'm writing a programm in Python, my goal is to :

  • read input xml file one line at a time
  • for each line find "CH" attribute
  • change attribute value : translate from french to portugese
  • write changed line into output xml file
  • as i manipulate texts in various languages i'd like to keep utf8 encoding to display foreign special characters in the output file

My code:

import os
import xml.etree.ElementTree as ET
from googletrans import Translator



        with open("input file.txt", "r", encoding='utf-8') as input_file:
            with open("output file.txt", "w", encoding='utf-8') as output_file:
                # Read input file
                for ligne in input_file:
                        # line parse
                        root = ET.fromstring(ligne)

                        # Change CH attribute value, translate from french fr to portugese pt
                        current_text= root.get("CH")
                        translator = Translator()
                        translated_text = translator.translate(dest="pt", src="fr", text=current_text)
                        root.attrib["CH"] = translated_text.text

                        # convert bytes to string 
                        decoded_string = ET.tostring(root).decode("utf-8")
                        
                        # write output file
                        output_file.write(decoded_string)

The problem is that in the output file i get non encoded chraracters, for example with the below input file:

<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="la victoire est à nous"/>
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="vive l'empereur"/>
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

i get this result:

<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vit&#243;ria &#233; nossa" />                                          
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />                 
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

instead of expected result :


<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" /> 
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />                 
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

i have checked with displays that the translated_text.text is well formated ("A vitória é nossa"), but decoded_string value is wrong despite utf8 coding specification : <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vit&#243;ria &#233; nossa" />

i do not understand why i have this result, could you please help me?

0

1 Answer 1

0

Parse the whole tree and iterate the ITEXT nodes. The following demonstrates how to change and write the text. Write the modified tree with the .write() method using an XML declaration and declaring the encoding:

# pip install googletrans==4.0.0rc1
# Note 3.0.0 didn't work
import xml.etree.ElementTree as ET
import googletrans as gt

tree = ET.parse('input file.txt')
translator = gt.Translator()
for itext in tree.iterfind('*/ITEXT'):
    current_text = itext.get('CH')
    itext.attrib['CH'] = translator.translate(dest="pt", src="fr", text=current_text).text
tree.write('output file.txt', xml_declaration=True, encoding='UTF-8')

output file.txt

<?xml version='1.0' encoding='UTF-8'?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0" />
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
                <para ALIGN="1" LINESP="10" />
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
                <trail ALIGN="1" LINESP="10" />
        </StoryText>
</SCRIBUSUTF8NEW>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.