5

I have an XML file like this:

<?xml version="1.0"?>
<PropertySet>
    <PropertySet NumOutputObjects="1" >
        <Message IntObjectName="Class Def" MessageType="Integration Object">
            <ListOf_Class_Def>
                <ImpExp Type="CLASS_DEF" Name="lp_pkg_cla" Object_Num="1001p">
                    <ListOfObject_Def>
                        <Object_Def Ancestor_Num="" Ancestor_Name="">
                        </Object_Def>
                    </ListOfObject_Def>
                    <ListOfObject_Arrt>
                        <Object_Arrt Orig_Id="6666p" Attr_Name="LP_Portable">
                        </Object_Arrt>
                    </ListOfObject_Arrt>
                </ImpExp>
            </ListOf_Class_Def>
        </Message>
    </PropertySet>
    <PropertySet NumOutputObjects="1" >
        <Message IntObjectName="Class Def" MessageType="Integration Object">
            <ListOf_Class_Def>
                <ImpExp Type="CLASS_DEF" Name="M_pkg_cla" Object_Num="1023i">
                    <ListOfObject_Def>
                        <Object_Def Ancestor_Num="" Ancestor_Name="">
                        </Object_Def>
                    </ListOfObject_Def>
                    <ListOfObject_Arrt>
                        <Object_Arrt Orig_Id="7010p" Attr_Name="O_Portable">
                        </Object_Arrt>
                        <Object_Arrt Orig_Id="7012j" Attr_Name="O_wireless">
                        </Object_Arrt>
                    </ListOfObject_Arrt>
                </ImpExp>
            </ListOf_Class_Def>
        </Message>
    </PropertySet>
    <PropertySet NumOutputObjects="1" >
        <Message IntObjectName="Prod Def" MessageType="Integration Object">
            <ListOf_Prod_Def>
                <ImpExp Type="PROD_DEF" Name="Laptop" Object_Num="2008a">
                    <ListOfObject_Def>
                        <Object_Def Ancestor_Num="1001p" Ancestor_Name="lp_pkg_cla">
                        </Object_Def>
                    </ListOfObject_Def>
                    <ListOfObject_Arrt>
                    </ListOfObject_Arrt>
                </ImpExp>
            </ListOf_Prod_Def>
        </Message>
    </PropertySet>
    <PropertySet NumOutputObjects="1" >
        <Message IntObjectName="Prod Def" MessageType="Integration Object">
            <ListOf_Prod_Def>
                <ImpExp Type="PROD_DEF" Name="Mouse" Object_Num="2987d">
                    <ListOfObject_Def>
                        <Object_Def Ancestor_Num="1023i" Ancestor_Name="M_pkg_cla">
                        </Object_Def>
                    </ListOfObject_Def>
                    <ListOfObject_Arrt>
                    </ListOfObject_Arrt>
                </ImpExp>
            </ListOf_Prod_Def>
        </Message>
    </PropertySet>
    <PropertySet NumOutputObjects="1" >
        <Message IntObjectName="Prod Def" MessageType="Integration Object">
            <ListOf_Prod_Def>
                <ImpExp Type="PROD_DEF" Name="Speaker" Object_Num="5463g">
                    <ListOfObject_Def>
                        <Object_Def Ancestor_Num="" Ancestor_Name="">
                        </Object_Def>
                    </ListOfObject_Def>
                    <ListOfObject_Arrt>
                    </ListOfObject_Arrt>
                </ImpExp>
            </ListOf_Prod_Def>
        </Message>
    </PropertySet>
</PropertySet>

I am hoping to extract the Name, Object_Num, Orig_Id and Attr_Name tags from it using Python and convert them into a .csv format.

The .csv format I'd like to see it in is simply:

ProductId   Product AttributeId Attribute
2008a   Laptop  6666p           LP_Portable
2987d   Mouse   7010p           O_Portable
2987d   Mouse   7012p           O_Wireless
5463g   Speaker ""          ""

Actually there is a relationship like this in xml tags:

  1. All products are in the tags, "ImpExp Type="PROD_DEF".. "
  2. All attributes are in the tags, "ImpExp Type="CLASS_DEF".. "
  3. If a product has attributes, then there is a tag
    <Object_Def Ancestor_Num="1023i".. >

  4. The Ancestor_Num is equal to Object_Num in tags, Type="CLASS_DEF"..

I have tried this:

from lxml import etree
import pandas
import HTMLParser 

inFile = "./newm.xml"
outFile = "./new.csv"

ctx1 = etree.iterparse(inFile, tag=("ImpExp", "ListOfObject_Def", "ListOfObject_Arrt",))


hp = HTMLParser.HTMLParser()
csvData = []
csvData1 = []
csvData2 = []
csvData3 = []
csvData4 = []
csvData5 = []

for event, elem in ctx1:
    value1 = elem.get("Type")
    value2 = elem.get("Name")
    value3 = elem.get("Object_Num")
    value4 = elem.get("Ancestor_Num")
    value5 = elem.get("Orig_Id")
    value6 = elem.get("Attr_Name")
    if value1 == "PROD_DEF":
        csvData.append(value2)
        csvData1.append(value3)
        for event, elem in ctx1:
            if value4 is not None:
                csvData2.append(value4)
                elem.clear()

df = pandas.DataFrame({'Product':csvData, 'ProductId':csvData1, 'AncestorId':csvData2})

for event, elem in ctx1: 
    if value1 == "Class Def":
        csvData3.append(value3)
        csvData4.append(value5)
        csvData5.append(value6)
        elem.clear()

df1 = pandas.DataFrame({'AncestorId':csvData3, 'AttribId':csvData4, 'AttribName':csvData5})

dff = pandas.merge(df, df1, on="AncestorId")
dff.to_csv(outFile, index = False)
2
  • You've shown what you have tried, but you need to edit the question to include what and where the problems are that you are having. Commented Feb 23, 2018 at 7:45
  • The problem is , using this code I can't get that table as a output. Please help me. Thank you. Commented Feb 26, 2018 at 3:53

2 Answers 2

2

Consider XSLT, the special purpose language designed to transform XML files and can directly convert XML to CSV (i.e., text file) without the pandas dataframe intermediary. Python's third-party module lxml (which you are already using) can run XSLT 1.0 scripts and do so without for loops or if logic. However, due to the complex alignment of product and attributes, some longer XPath searches are used with XSLT.

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="no" method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:param name="delimiter">,</xsl:param>

  <xsl:template match="/PropertySet">
      <xsl:text>ProductId,Product,AttributeId,Attribute&#xa;</xsl:text>
      <xsl:apply-templates select="*"/>
  </xsl:template>

  <xsl:template match="PropertySet|Message|ListOf_Class_Def|ListOf_Prod_Def|ImpExp">
      <xsl:apply-templates select="*"/>
  </xsl:template>

  <xsl:template match="ListOfObject_Arrt">
    <xsl:apply-templates select="Object_Arrt"/>
    <xsl:if test="name(*) != 'Object_Arrt' and preceding-sibling::ListOfObject_Def/Object_Def/@Ancestor_Name = ''">
       <xsl:value-of select="concat(ancestor::ImpExp/@Name, $delimiter,
                                    ancestor::ImpExp/@Object_Num, $delimiter,
                                    '', $delimiter,
                                    '')"/><xsl:text>&#xa;</xsl:text>
    </xsl:if>   
  </xsl:template>

  <xsl:template match="Object_Arrt">
    <xsl:variable name="attrName" select="ancestor::ImpExp/@Name"/>
    <xsl:value-of select="concat(/PropertySet/PropertySet/Message[@IntObjectName='Prod Def']/ListOf_Prod_Def/
                                 ImpExp[ListOfObject_Def/Object_Def/@Ancestor_Name = $attrName]/@Name, $delimiter,

                                 /PropertySet/PropertySet/Message[@IntObjectName='Prod Def']/ListOf_Prod_Def/
                                 ImpExp[ListOfObject_Def/Object_Def/@Ancestor_Name = $attrName]/@Object_Num, $delimiter,

                                 @Orig_Id, $delimiter,
                                 @Attr_Name)"/><xsl:text>&#xa;</xsl:text>
  </xsl:template>

</xsl:stylesheet>

Python

import lxml.etree as et

# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')

# RUN TRANSFORMATION
transform = et.XSLT(xsl)    
result = transform(xml)

# OUTPUT TO FILE
with open('Output.csv', 'wb') as f:
    f.write(result)

Output

ProductId,Product,AttributeId,Attribute
Laptop,2008a,6666p,LP_Portable
Mouse,2987d,7010p,O_Portable
Mouse,2987d,7012j,O_wireless
Speaker,5463g,,
Sign up to request clarification or add additional context in comments.

1 Comment

This is a much cleaner solution, thank you soo much for taking the time to explain it. :)
2

You would need to preparse all of the CLASS_DEF entries into a dictionary. These can then be looked up when processing the PROD_DEF entries:

import csv
from lxml import etree

inFile = "./newm.xml"
outFile = "./new.csv"

tree = etree.parse(inFile)
class_defs = {}

# First extract all the CLASS_DEF entries into a dictionary
for impexp in tree.iter("ImpExp"):
    name = impexp.get('Name')

    if impexp.get('Type') == "CLASS_DEF":
        for list_of_object_arrt in impexp.findall('ListOfObject_Arrt'):
            class_defs[name] = [(obj.get('Orig_Id'), obj.get('Attr_Name')) for obj in list_of_object_arrt]

with open(outFile, 'wb') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(['ProductId', 'Product', 'AttributeId', 'Attribute'])

    for impexp in tree.iter("ImpExp"):
        object_num = impexp.get('Object_Num')
        name = impexp.get('Name')

        if impexp.get('Type') == "PROD_DEF":
            for list_of_object_def in impexp.findall('ListOfObject_Def'):
                for obj in list_of_object_def:
                    ancestor_num = obj.get('Ancestor_Num')
                    ancestor_name = obj.get('Ancestor_Name')

            csv_output.writerow([object_num, name] + list(class_defs.get(ancestor_name, [['', '']])[0]))

This would produce new.csv containing:

ProductId,Product,AttributeId,Attribute
2008a,Laptop,6666p,LP_Portable
2987d,Mouse,7010p,O_Portable
5463g,Speaker,,

If you are using Python 3.x, use:

with open(outFile, 'w', newline='') as f_output:    

1 Comment

This is a also a cleaner solution, but output is much different than I wished. But this is helpful to me to understand about reading nested xml file and extract data. Thank you so much for taking the time to explain it. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.