How to sort xml by node value in python

Question

I'm running on Ubuntu 18.04. A Python 2 or 3 solution would be preferred. I've got xml structured like so:

<Records>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
</Records>

But I've got a lot of it, a single 10GB file worth. I'm looking for the most efficient way to sort the records on tstamp, such that the output would look like this:

<Records>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
</Records>

Thanks in advance.

what did you try? do you want to create a new file that is sorted? — balderman
– balderman, Commented Nov 3, 2020 at 7:44

Parfait · Accepted Answer · 2020-11-03 14:52:37Z

1

Below is a code that sort the records by 'tstamp'

import datetime
import xml.etree.ElementTree as ET

xml = '''<Records>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>99</recID>
    <tstamp>1999-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>88</recID>
    <tstamp>2020-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
   <Record>
    <recID>11</recID>
    <tstamp>2012-11-11T13:50:00.00Z</tstamp>
  </Record>
</Records>'''
root = ET.fromstring(xml)
records = root.findall('.//Record')
records = sorted(records, key=lambda r: datetime.datetime.strptime(r.find('tstamp').text[:19], '%Y-%m-%dT%H:%M:%S'))
for r in records:
    print(f'{r.find("tstamp").text} -- {r.find("recID").text}')
root = ET.Element('Records')
root.extend(records)

ET.ElementTree(root).write('c:\\temp\\output.xml')

output

1999-11-11T13:50:00.00Z -- 99
2012-11-11T13:50:00.00Z -- 11
2018-10-10T12:03:02.28Z -- 456
2018-11-11T13:50:00.00Z -- 789
2018-12-31T23:59:42.38Z -- 123
2020-11-11T13:50:00.00Z -- 88

edited Nov 3, 2020 at 14:52

Parfait

108k19 gold badges103 silver badges138 bronze badges

answered Nov 3, 2020 at 8:08

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Samer A. Over a year ago

Hmmm...printing on each records iteration like you have does print sorted...but when I write to file like this ET.ElementTree(root).write("sorted.xml"), the xml in the output file is not sorted. Why doesn't writing to file preserve the sort?

balderman Over a year ago

@SamerA. code is updated and the data is written to a file (sorted)

Parfait Over a year ago

Interesting to see if this is feasible on a 10 GB file. I imagine this line root.findall('.//Record') to be a strenuous. Maybe even this: root.extend(records).

balderman Over a year ago

I agree - the code needs lots of RAM (given the file size..)

Samer A. Over a year ago

Nice! Just another followup observation; if the root node <Records> had namespace and attributes like this <Records msgVersion="RevB" xmlns="http://www.w3.org/foobar">, something breaks. How can I tell ElementTree a namespace to use? I have not tried the code on a 10GB file yet, crossing my fingers...

|

Collectives™ on Stack Overflow

How to sort xml by node value in python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related