0

I have an XML file structured like this:

<?xml version="1.0" encoding="utf-8" ?>
    <pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="179.739,592.028,261.007,604.510">
    <textline bbox="179.739,592.028,261.007,604.510">
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">A</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">P</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">T</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">L</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
    <text> </text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text>
    </text>
    </textline>
    </textbox>
</page>
</pages>

Note that the file is much bigger than that, and it repeats pages. So I have two kinds of tags, differing in font and font size. I want to merge the letters of the same tags, so I would like an output that keeps the font and font size but also merges what can be merged together, like:

<?xml version="1.0" encoding="utf-8" ?>
        <pages>
        <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
        <textline bbox="179.739,592.028,261.007,604.510">
        <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
        <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">API</text>
        <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">TOLO III</text>
        </textline>
        </textbox>
    </page>
    </pages>

The important thing is that the original order of the letters is kept, along with the tags (so I know which is the font size). So the code so far looks like this:

import xml.etree.ElementTree as ET

MY_XML = ET.parse('fe.xml')

textlines = MY_XML.findall("./page/textbox/textline")

for textline in textlines:
    fulltext = []
    for text_elem in list(textline):
        # Get the text of each 'text' element and then remove it
        fulltext.append(text_elem.text)
        textline.remove(text_elem)

    # Create a new 'text' element and add the joined letters to it
    new_text_elem = ET.Element("text", font="NUMPTY+ImprintMTnum", ncolour="0", size="12.482")
    new_text_elem.text = "".join(fulltext).strip()

    # Append the new 'text' element to its parent
    textline.append(new_text_elem)

print(ET.tostring(MY_XML.getroot(), encoding="unicode"))

But it works only for one tag. I think I would need to put a condition so that the for loop checks for all tags, but I haven't found information on the web on how to do it. How can I include the other tag? Many thanks

2 Answers 2

1

Why is new_text_elem a hardcoded Element, with fixed attributes? You don't know which attributes to assign.

Try the following. Create another inner for loop that writes ALL tags to a dictionary. You can iterate over tags as well.

For the next element check if all tags are in the dictionary and if they are the same. Read about dictionary comparison or just iterate over the keys and compare with ==.

If they are the same add the element to a list of identical elements you found so far. Then check the next element.

If they are not the same add all elements of the list as a new element, combining the text. Then clear the list and start over.

Summary:

  • You iterate through the <text> Elements.
  • Store all consecutive <text> Elements have the same tags in a list.
  • Storing them one after another in a list preserves the order.
  • Once you encounter the first different <text> Element, write the stored ones first, concatenating their text, using their stored tags and values.
  • Clear the store list and repeat.
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for your answer, however it's not really clear... what do you mean with "for the next element"? Because I want the text to be in the same order as I find it, so I don't understand how that can be guaranteed since I'm just comparing tags (if I understood it well)
I added a list at the bottom of the answer.
Thank you again, but I don't know how to compare XML tags... I have looked on the internet and found nothing
Do you know how to read them? They are called attributes in XML. And once you know how to read them you can compare them like is_same = "12.482" == some_attribute
1

Below is the core logic you need to implement.

It is not an End to End solution but it takes care of the core logic.

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="utf-8" ?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
                <text> </text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text>
                </text>
            </textline>
        </textbox>
    </page>
</pages>'''
words = []
root = ET.fromstring(xml)
pages = root.findall('.//page')
for page in pages:
    previous_key = None
    current_key = None
    texts = page.findall('.//text')
    for txt in texts:
        if previous_key:
            current_key = (txt.attrib.get('font',previous_key[0]),txt.attrib.get('size',previous_key[1]))
        else:
            current_key = (txt.attrib.get('font','empty'),txt.attrib.get('size','empty'))
        if current_key != previous_key:
            words.append([])
        words[-1].append(txt.text)
        previous_key = current_key

for group in words:
    if group:
        print(''.join(group))

output

C
API
TOLO III

3 Comments

thank you very much! but the output I have isn't the same you posted, I wonder why! But it really helps
also, it doesn't keep the tags, which are important to me because I need to know the font size of the text
My answer includes only the core logic. I thought you will ba able to take it from here.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.