How to join XML tags under condition (Python)?

Question

I have an XML file structured like this:

<?xml version="1.0" encoding="utf-8" ?>
    <pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="179.739,592.028,261.007,604.510">
    <textline bbox="179.739,592.028,261.007,604.510">
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">A</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">P</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">T</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">L</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
    <text> </text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text>
    </text>
    </textline>
    </textbox>
</page>
</pages>

Note that the file is much bigger than that, and it repeats pages. So I have two kinds of tags, differing in font and font size. I want to merge the letters of the same tags, so I would like an output that keeps the font and font size but also merges what can be merged together, like:

<?xml version="1.0" encoding="utf-8" ?>
        <pages>
        <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
        <textline bbox="179.739,592.028,261.007,604.510">
        <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
        <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">API</text>
        <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">TOLO III</text>
        </textline>
        </textbox>
    </page>
    </pages>

The important thing is that the original order of the letters is kept, along with the tags (so I know which is the font size). So the code so far looks like this:

import xml.etree.ElementTree as ET

MY_XML = ET.parse('fe.xml')

textlines = MY_XML.findall("./page/textbox/textline")

for textline in textlines:
    fulltext = []
    for text_elem in list(textline):
        # Get the text of each 'text' element and then remove it
        fulltext.append(text_elem.text)
        textline.remove(text_elem)

    # Create a new 'text' element and add the joined letters to it
    new_text_elem = ET.Element("text", font="NUMPTY+ImprintMTnum", ncolour="0", size="12.482")
    new_text_elem.text = "".join(fulltext).strip()

    # Append the new 'text' element to its parent
    textline.append(new_text_elem)

print(ET.tostring(MY_XML.getroot(), encoding="unicode"))

But it works only for one tag. I think I would need to put a condition so that the for loop checks for all tags, but I haven't found information on the web on how to do it. How can I include the other tag? Many thanks

Joe · Accepted Answer · 2020-04-11 15:09:29Z

1

Why is new_text_elem a hardcoded Element, with fixed attributes? You don't know which attributes to assign.

Try the following. Create another inner for loop that writes ALL tags to a dictionary. You can iterate over tags as well.

For the next element check if all tags are in the dictionary and if they are the same. Read about dictionary comparison or just iterate over the keys and compare with ==.

If they are the same add the element to a list of identical elements you found so far. Then check the next element.

If they are not the same add all elements of the list as a new element, combining the text. Then clear the list and start over.

Summary:

You iterate through the <text> Elements.
Store all consecutive <text> Elements have the same tags in a list.
Storing them one after another in a list preserves the order.
Once you encounter the first different <text> Element, write the stored ones first, concatenating their text, using their stored tags and values.
Clear the store list and repeat.

edited Apr 11, 2020 at 15:09

answered Apr 11, 2020 at 10:43

Joe

7,2433 gold badges31 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Anna Over a year ago

Thank you for your answer, however it's not really clear... what do you mean with "for the next element"? Because I want the text to be in the same order as I find it, so I don't understand how that can be guaranteed since I'm just comparing tags (if I understood it well)

Joe Over a year ago

I added a list at the bottom of the answer.

Anna Over a year ago

Thank you again, but I don't know how to compare XML tags... I have looked on the internet and found nothing

Joe Over a year ago

Do you know how to read them? They are called attributes in XML. And once you know how to read them you can compare them like is_same = "12.482" == some_attribute

Joe Over a year ago

docs.python.org/3.4/library/xml.etree.elementtree.html Look for child.attrib

balderman · Accepted Answer · 2020-04-12 08:52:12Z

1

Below is the core logic you need to implement.

It is not an End to End solution but it takes care of the core logic.

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="utf-8" ?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
                <text> </text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text>
                </text>
            </textline>
        </textbox>
    </page>
</pages>'''
words = []
root = ET.fromstring(xml)
pages = root.findall('.//page')
for page in pages:
    previous_key = None
    current_key = None
    texts = page.findall('.//text')
    for txt in texts:
        if previous_key:
            current_key = (txt.attrib.get('font',previous_key[0]),txt.attrib.get('size',previous_key[1]))
        else:
            current_key = (txt.attrib.get('font','empty'),txt.attrib.get('size','empty'))
        if current_key != previous_key:
            words.append([])
        words[-1].append(txt.text)
        previous_key = current_key

for group in words:
    if group:
        print(''.join(group))

output

C
API
TOLO III

answered Apr 12, 2020 at 8:52

balderman

24k8 gold badges39 silver badges60 bronze badges

3 Comments

Anna Over a year ago

thank you very much! but the output I have isn't the same you posted, I wonder why! But it really helps

Anna Over a year ago

also, it doesn't keep the tags, which are important to me because I need to know the font size of the text

balderman Over a year ago

My answer includes only the core logic. I thought you will ba able to take it from here.

Collectives™ on Stack Overflow

How to join XML tags under condition (Python)?

2 Answers 2

5 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related