3

I'm trying to sort my XML alphabetically while ensuring that a specific element stays at the top. I have managed to sort it alphabetically, but I cannot get that element to stay. Here is what I have so far:

from lxml import etree

data = """
<Example xmlns="http://www.example.org">
    <E>
        <A>A</A>
        <B>B</B>
        <C>C</C>
    </E>
    <B>B</B>
    <D>D</D>
    <A>A</A>
    <C>C</C>
    <F>F</F>
</Example>
"""
doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))

for parent in doc.xpath('//*[./*]'):
    parent[:] = sorted(parent,key=lambda x: x.tag)

print etree.tostring(doc,pretty_print=True)

The result from this is:

<Example xmlns="http://www.example.org">
  <A>A</A>
  <B>B</B>
  <C>C</C>
  <D>D</D>
  <E>
    <A>A</A>
    <B>B</B>
    <C>1</C>
  </E>
  <F>F</F>
</Example>

Is there anyway I can stop the <E></E> part and its contents from moving?

2
  • What is it about <E> that makes it an element which should not be sorted? Is it because it has child nodes? Commented Sep 6, 2017 at 16:40
  • @James Nope, the child nodes do not matter. I want to make the XML conform to a given schema, which requires that <E> stays at the top, but I wish to sort the rest alphabetically. Commented Sep 6, 2017 at 16:51

2 Answers 2

2

You can handle this in at least 2 ways. You could sort everything, and then force <E> to the top through a custom sorting function. Also, you could split the elements to-be-sorted out, sort them, and append them to the end of the non-sorted elements.

Custom sort:

Sorting for text occurs using progressive code points. You can get the code point for a single character using ord(). The lowest printed character is the tab. So for sorting we can tell python to sort all of the elements normally, unless the tag is <E>, then use a tab for sorting which will get sorted first.

There is some extra code to handle the namespace.

doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))
ns = doc.nsmap

for parent in doc.xpath('//*[./*]'):
    parent[:] = sorted(parent,key=lambda x: x.tag if x.tag!='{'+ns[None]+'}E' else '\t')

print(etree.tostring(doc,pretty_print=True).decode('ascii'))

<Example xmlns="http://www.example.org">
  <E>
    <A>A</A>
    <B>B</B>
    <C>C</C>
  </E>
  <A>A</A>
  <B>B</B>
  <C>C</C>
  <D>D</D>
  <F>F</F>
</Example>

Split, apply, combine

Here we split the parent into two lists, sort the second list, and then merge them.

doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))
ns = doc.nsmap
for parent in doc.xpath('//*[./*]'):
    to_sort = (e for e in parent if e.tag!='{'+ns[None]+'}E')
    non_sort = (e for e in parent if e.tag=='{'+ns[None]+'}E')
    parent[:] = list(non_sort) + sorted(to_sort, key=lambda e: e.tag)
print(etree.tostring(doc,pretty_print=True).decode('ascii'))

<Example xmlns="http://www.example.org">
  <E>
    <A>A</A>
    <B>B</B>
    <C>C</C>
  </E>
  <A>A</A>
  <B>B</B>
  <C>C</C>
  <D>D</D>
  <F>F</F>
</Example>
Sign up to request clarification or add additional context in comments.

1 Comment

Fantastic. Thanks for both of the methods! I like the second one. When I try the second method, it also sorts the child nodes inside the non_sort list. Should it sort that list? I thought it wouldn't as that was not included in the sorted() function. I forgot to include it in the question, but I'm actually not looking to sort the child nodes inside <E>, so that'd be ideal.
2

It could work with the following way, but it seems the simple tag cannot be reached, so it uses the long tag, including the xmlns part :

doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))

    for parent in doc.xpath('//*[./*]'):
        parent[:] = sorted(parent,
                           key=lambda x: (not x.tag =='{http://www.example.org}E', x.tag))

    print(etree.tounicode(doc,pretty_print=True))

This code will output :

<Example xmlns="http://www.example.org">
  <E>
    <A>A</A>
    <B>B</B>
    <C>C</C>
  </E>
  <A>A</A>
  <B>B</B>
  <C>C</C>
  <D>D</D>
  <F>F</F>
</Example>
   </Example>\n'

The following code just outputs these long tags to understand what they look like :

doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))

    for parent in doc.xpath('//*[./*]'):
        for item in parent:
            print(item.tag)

    {http://www.example.org}E
    {http://www.example.org}B
    {http://www.example.org}D
    {http://www.example.org}A
    {http://www.example.org}C
    {http://www.example.org}F
    {http://www.example.org}A
    {http://www.example.org}B
    {http://www.example.org}C

Another way is to use an helper function to parse the tag to make it more readable :

def normalize(name):
    if name[0] == "{":
        uri, tag = name[1:].split("}")
        return tag
    else:
        return name

doc = etree.XML(data, etree.XMLParser(remove_blank_text=True))

for parent in doc.xpath('//*[./*]'):
    parent[:] = sorted(parent,
                       key=lambda x: (not normalize(x.tag) == 'E', x.tag))

1 Comment

Fantastic, thank you! Is there anyway to custom sort the ordering inside <E>? Forgot to include the fact that the ordering of the child nodes inside that must be specific, rather than alphabetical.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.