0

Using Python and Elementtree, I'm having trouble parsing XML into text line items such that each line item represents one level only, no more, no less. Each line item will be eventually one record in a database such that the user can search on multiple terms within that field. Sample XML:

?xml version="1.0" encoding="utf-8"?>
 <root>
    <mainTerm>
      <title>Meat</title>
      <see>protein</see>
    </mainTerm>
    <mainTerm>
      <title>Vegetables</title>
      <see>starch</see>
    </mainTerm>
    <mainTerm>
      <title>Fruit</nemod></title>
      <term level="1">
        <title>Apple</title>
        <code>apl</code>
      </term>
      <term level="1">
        <title>Red Delicious</title>
        <code>rd</code>
        <term level="2">
          <title>Large Red Delicious</title>
          <code>lrd</code>
        </term>
        <term level="2">
          <title>Medium Red Delicious</title>
          <code>mrd</code>
        </term>
        <term level="2">
          <title>Small Red Delicious</title>
          <code>mrd</code>
        </term>        
      <term level="1">
        <title>Grapes</title>
        <code>grp</code>
      </term>
      <term level="1">
        <title>Peaches</title>
        <code>pch</code>
      </term>      
    </mainTerm>
</root>

Desired Output:

Meat > protein
Vegetables > starch
Fruit > Apple > apl
Fruit > Apple > apl > Red Delicious > rd
Fruit > Apple > apl > Red Delicious > rd > Large Red Delicious > lrd
Fruit > Apple > apl > Red Delicious > rd > Medium Red Delicious > mrd
Fruit > Apple > apl > Red Delicious > rd > Small Red Delicious > srd
Fruit > Grapes > grp
Fruit > Peaches > pch

It's easy enough to use the tag 'mainTerm' to parse the XML, but the tricky part is limiting each line to only one level but at the same time including the upper level terms as well in the text. I'm basically trying to "flatten" the XML hierarchy by creating unique lines of text, each of which lists its parents (e.g. Fruit > Apple > apl) but not its siblings (e.g. Large Red Delicious, Medium Red Delicious, or Small Red Delicious).

I realize this can be accomplished by first converting the data to a relational database format, then running a query, etc, but I was hoping for a more direct solution directly from the XML.

Hope this makes sense...thanks

1
  • The xml you've provided is not well-formed: see that strange </nemod> tag, no closing <term> tag. Commented Apr 8, 2014 at 2:23

1 Answer 1

1

There is a nice tool called xmltodict that makes an hierarchic data structure right out of the xml:

import json
import xmltodict


data = """your xml goes here"""

result = xmltodict.parse(data)
print(json.dumps(result, indent=4))

For the xml you've provided (with several alterations to make it well-formed, see my comment) it prints:

{
    "root": {
        "mainTerm": [
            {
                "title": "Meat", 
                "see": "protein"
            }, 
            {
                "title": "Vegetables", 
                "see": "starch"
            }, 
            {
                "title": "Fruit", 
                "term": [
                    {
                        "@level": "1", 
                        "title": "Apple", 
                        "code": "apl"
                    }, 
                    {
                        "@level": "1", 
                        "title": "Red Delicious", 
                        "code": "rd", 
                        "term": [
                            {
                                "@level": "2", 
                                "title": "Large Red Delicious", 
                                "code": "lrd"
                            }, 
                            {
                                "@level": "2", 
                                "title": "Medium Red Delicious", 
                                "code": "mrd"
                            }, 
                            {
                                "@level": "2", 
                                "title": "Small Red Delicious", 
                                "code": "mrd"
                            }
                        ]
                    }, 
                    {
                        "@level": "1", 
                        "title": "Grapes", 
                        "code": "grp"
                    }, 
                    {
                        "@level": "1", 
                        "title": "Peaches", 
                        "code": "pch"
                    }
                ]
            }
        ]
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the reply, but what I was after was a single line for each level (with the text from higher levels separated by ">" symbols). Alternatively, I guess if I converted the XML to JSON, as you imply, I could search the data via JavaScript / JSON instead of PHP / SQL, though I suspect much less efficiently.
@user1526973 yeah, this is actually my point here. Convert it to something that you can more easily work with. Hope it helps.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.