I have an XML file structured like this:
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">O</text>
<text> </text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>
Note that the file is much bigger than that, and it repeats pages. So I have two kinds of tags, differing in font and font size. I want to merge the letters of the same tags, so I would like an output that keeps the font and font size but also merges what can be merged together, like:
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" ncolour="0" size="12.333">API</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">TOLO III</text>
</textline>
</textbox>
</page>
</pages>
The important thing is that the original order of the letters is kept, along with the tags (so I know which is the font size). So the code so far looks like this:
import xml.etree.ElementTree as ET
MY_XML = ET.parse('fe.xml')
textlines = MY_XML.findall("./page/textbox/textline")
for textline in textlines:
fulltext = []
for text_elem in list(textline):
# Get the text of each 'text' element and then remove it
fulltext.append(text_elem.text)
textline.remove(text_elem)
# Create a new 'text' element and add the joined letters to it
new_text_elem = ET.Element("text", font="NUMPTY+ImprintMTnum", ncolour="0", size="12.482")
new_text_elem.text = "".join(fulltext).strip()
# Append the new 'text' element to its parent
textline.append(new_text_elem)
print(ET.tostring(MY_XML.getroot(), encoding="unicode"))
But it works only for one tag. I think I would need to put a condition so that the for loop checks for all tags, but I haven't found information on the web on how to do it. How can I include the other tag? Many thanks