0

I'm running on Ubuntu 18.04. A Python 2 or 3 solution would be preferred. I've got xml structured like so:

<Records>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
</Records>

But I've got a lot of it, a single 10GB file worth. I'm looking for the most efficient way to sort the records on tstamp, such that the output would look like this:

<Records>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
</Records>

Thanks in advance.

3
  • 1
    what did you try? do you want to create a new file that is sorted? Commented Nov 3, 2020 at 7:44
  • Do you have a single 10 GB XML file? Commented Nov 3, 2020 at 8:13
  • Yes, I have a single 10GB xml file. Commented Nov 3, 2020 at 12:56

1 Answer 1

1

Below is a code that sort the records by 'tstamp'

import datetime
import xml.etree.ElementTree as ET

xml = '''<Records>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>99</recID>
    <tstamp>1999-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>88</recID>
    <tstamp>2020-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
   <Record>
    <recID>11</recID>
    <tstamp>2012-11-11T13:50:00.00Z</tstamp>
  </Record>
</Records>'''
root = ET.fromstring(xml)
records = root.findall('.//Record')
records = sorted(records, key=lambda r: datetime.datetime.strptime(r.find('tstamp').text[:19], '%Y-%m-%dT%H:%M:%S'))
for r in records:
    print(f'{r.find("tstamp").text} -- {r.find("recID").text}')
root = ET.Element('Records')
root.extend(records)

ET.ElementTree(root).write('c:\\temp\\output.xml')

output

1999-11-11T13:50:00.00Z -- 99
2012-11-11T13:50:00.00Z -- 11
2018-10-10T12:03:02.28Z -- 456
2018-11-11T13:50:00.00Z -- 789
2018-12-31T23:59:42.38Z -- 123
2020-11-11T13:50:00.00Z -- 88
Sign up to request clarification or add additional context in comments.

6 Comments

Hmmm...printing on each records iteration like you have does print sorted...but when I write to file like this ET.ElementTree(root).write("sorted.xml"), the xml in the output file is not sorted. Why doesn't writing to file preserve the sort?
@SamerA. code is updated and the data is written to a file (sorted)
Interesting to see if this is feasible on a 10 GB file. I imagine this line root.findall('.//Record') to be a strenuous. Maybe even this: root.extend(records).
I agree - the code needs lots of RAM (given the file size..)
Nice! Just another followup observation; if the root node <Records> had namespace and attributes like this <Records msgVersion="RevB" xmlns="http://www.w3.org/foobar">, something breaks. How can I tell ElementTree a namespace to use? I have not tried the code on a 10GB file yet, crossing my fingers...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.