2

I have a long XML structured like this:

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="191.745,592.218,249.042,603.578">
      <textline bbox="191.745,592.218,249.042,603.578">
<new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="280.592,553.628,285.109,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
          <text font="NUMPTY+ImprintMTnum" bbox="284.964,553.628,290.760,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">v</text>
          <text font="NUMPTY+ImprintMTnum" bbox="290.382,553.628,295.477,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
          <text font="NUMPTY+ImprintMTnum" bbox="295.333,553.628,301.707,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
          <text font="NUMPTY+ImprintMTnum" bbox="301.563,553.628,305.390,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">t</text>
          <text font="NUMPTY+ImprintMTnum" bbox="305.245,553.628,311.620,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
          <text font="NUMPTY+ImprintMTnum" bbox="311.475,553.628,315.992,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
          <text font="NUMPTY+ImprintMTnum" bbox="315.847,553.628,320.942,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="320.798,553.628,324.625,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
          <text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="327.763,553.639,331.590,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="331.445,553.639,337.241,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">p</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="337.097,553.639,340.924,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="340.312,553.639,343.560,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">.</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="343.416,553.639,346.319,566.366" colourspace="DeviceGray" ncolour="0" size="12.727"> </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="355.660,553.628,365.283,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
          <text font="NUMPTY+ImprintMTnum" bbox="365.139,553.628,368.387,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
          <text font="NUMPTY+ImprintMTnum" bbox="368.242,553.628,372.759,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">-</text>
        </new_line>
</textline>
    </textbox>
</page>
</pages>

The actual XML is way longer and has more pages.

You can see the "size" tag has different sizes. I want to join the letters of the text tags within the <new_line> tag that have the same sizes, keeping their order of appearance.

My expected output is an XML file:

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="191.745,592.218,249.042,603.578">
      <textline bbox="191.745,592.218,249.042,603.578">
<new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="280.592,553.628,285.109,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">sventura ] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="327.763,553.639,331.590,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">sps. a</text> 
          <text font="NUMPTY+ImprintMTnum" bbox="355.660,553.628,365.283,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">mi-</text>

</textline>
    </textbox>
</page>
</pages>

Important, the order of the characters has to be kept. I tried in many ways but with no success. How is it possible to achieve my desired output?

EDIT: I tried to compare the attributes like this, but I need to keep the tag:

  words = []
    root = ET.fromstring(xml)
    pages = root.findall('.//page')
    for page in pages:
        previous_key = None
        current_key = None
        texts = page.findall('.//text')
        for txt in texts:
            if previous_key:
                current_key = (txt.attrib.get('font',previous_key[0]),txt.attrib.get('size',previous_key[1]))
            else:
                current_key = (txt.attrib.get('font','empty'),txt.attrib.get('size','empty'))
            if current_key != previous_key:
                words.append([])
            words[-1].append(txt.text)
            previous_key = current_key

    for group in words:
        if group:
            print(''.join(group))
3
  • Can you share what you've tried? It might be you're close and just need a nudge in the right direction or at least it'd show what doesn't work so others don't offer it as a potential answer. Commented Apr 16, 2020 at 16:30
  • Sure, I updated my question if it helps! Commented Apr 16, 2020 at 16:38
  • That's great - thanks. Commented Apr 16, 2020 at 17:11

1 Answer 1

1

You can try the following approach:

  • Iterate over all new_line elements. For all these new_lines:
    • Find all children text elements and save it in a list using findall.
    • Iterate over the text_list with current and previous elements using zip (see this discussion for more details: zip(l[:-1], l[1:])
    • Get the size of current and previous element
    • If they are equals and not both null:
      • Get current and previous text
      • Add them to current element
      • Remove the previous element using remove

Code

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('data.xml', parser)
root = tree.getroot()

# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
    # Find all "test" element in the new_line block
    list_text_elts = new_line_block.findall('text')

    # Iterate over all of them with the current and previous ones
    for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
        # Get size elements
        prev_size = previous_text.attrib.get('size')
        curr_size = current_text.attrib.get('size')
        # If they are equals and not both null
        if curr_size == prev_size and curr_size is not None:
            # Get current and previous text
            pt = previous_text.text if previous_text.text is not None else ""
            ct = current_text.text if current_text.text is not None else ""
            # Add them to current element
            current_text.text = pt + ct  
            # Remove preivous element             
            previous_text.getparent().remove(previous_text)


newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
newtree = newtree.decode("utf-8")

output.xml

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="191.745,592.218,249.042,603.578">
      <textline bbox="191.745,592.218,249.042,603.578">
        <new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">sventura] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">sps. a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="368.242,553.628,372.759,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">mi-</text>
        </new_line>
      </textline>
    </textbox>
  </page>
</pages>

I let you adapt it to process different pages !

Sign up to request clarification or add additional context in comments.

4 Comments

It seems great! But I get this error: new_line_block.remove(previous_text) File "src\lxml\etree.pyx", line 943, in lxml.etree._Element.remove ValueError: Element is not a child of this node.
Have a look at the update. I think there are some <text> element not directly below the <new_line> elements. There is an intermediate tag like <new_line><some extra tags><text> ... </text></some extra tags></new_line>. The current solution doesn't care about intermediate elements..
Thank you! that should have been the problem, but now another problem occurs, I get this error: line 25, in <module> previous_text.getParent().remove(previous_text) AttributeError: 'lxml.etree._Element' object has no attribute 'getParent'
Thank you again, I just noticed something: the joining of letters has to be done just in <newline> elements, while your code provides a union regardless of opening and closing of <newline>. That is to say, I want to join elements only inside <newline> tag until it is closed, then join again when a new one is open. How can I solve this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.