Dump data from <></> tags in xml to csv in python (multiple different xml stylesheets formats) [duplicate]

Question

--Even after learning a little about XSLT, I didn't use it as the metadata/xls formats change so a single stylesheet based approach won't work ---

I have been trying for the last few hours to grab a csv and dump the data in each tag to a CSV but nothing has worked. I have tried elemtree, parse and regex based on a few other Q&A's in the forum.

For example works fine for his test data but it won't work on my xml (sample at end of question).

tree = ET.parse("test2.xml")
doc = tree.getroot()
thingy = doc.find('custod')
print thingy.attrib

Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'attrib'

doc
<Element anzmeta at 801a300>
thingy = doc.find('anzmeta')
print thingy.attrib

Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'attrib'

doc.attrib
{}

--- Try using REX

rex = re.compile(r'<custod.*?>(.*?)</custod>',re.S|re.M)
rex
<_sre.SRE_Pattern object at 0x080724A0>
match=rex.match('test2.xml')
match
text = match.groups()[0].strip()

Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'groups'

All I need is for the system to go through my xml files and create a csv which has the complete entry of each tag in a column of the csv. It should add columns to the csv if they don't exist and then populate them accordingly.

=========== XML sample

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='ANZMeta.xsl'?>
<anzmeta>
  <citeinfo>
    <uniqueid />
    <title>&lt;&gt;</title>
    <origin>
      <custod>ATGIS</custod>
      <jurisdic>
        <keyword thesaurus="">Tablelands Regional Council</keyword>
      </jurisdic>
    </origin>
  </citeinfo>
  <descript>
    <abstract>&lt;&gt;
    </abstract>
    <theme>
      <keyword thesaurus="">EPSG</keyword>
    </theme>
    <spdom>
      <keyword thesaurus="">GDA94</keyword>
      <keyword thesaurus="">GRS80</keyword>
      <keyword thesaurus="">Map Grid of Australia</keyword>
      <keyword thesaurus="">Zone 55 (144E - 150E)</keyword>
      <bounding>
        <northbc />
        <southbc />
        <eastbc />
        <westbc />
      </bounding>
    </spdom>
  </descript>
  <timeperd>
    <begdate>
      <date>2012</date>
    </begdate>
    <enddate>
      <keyword thesaurus="">Completed</keyword>
    </enddate>
  </timeperd>
  <status>
    <progress>
      <keyword thesaurus="">Ongoing</keyword>
      <keyword thesaurus="">Completed</keyword>
    </progress>
    <update>
      <keyword thesaurus="">As Required</keyword>
      <keyword thesaurus="">As Required</keyword>
    </update>
  </status>
  <distinfo>
    <native>
      <nondig>
        <formname>File</formname>
      </nondig>
      <digform>
        <formname>Type:</formname>
      </digform>
    </native>
    <avlform>
      <nondig>
        <formname>Format:</formname>
      </nondig>
      <digform>
        <formname>Size</formname>
      </digform>
    </avlform>
    <accconst>Internal Use Only</accconst>
  </distinfo>
  <dataqual>
    <lineage>~TBC~</lineage>
    <procstep>
      <procdesc Sync="TUE">Metadata imported.</procdesc>
      <srcused Sync="TRUE">L:\Data_Admin\MetadataGenerator\trc_Metadata_Template.xml</srcused>
      <date Sync="TRUE">20121206</date>
      <time Sync="TRUE">15341400</time>
    </procstep>
    <posacc>~TBC~</posacc>
    <attracc>~TBC~</attracc>
    <logic>~TBC~</logic>
    <complete>~TBC~</complete>
  </dataqual>
  <cntinfo>
    <cntorg>Atherton Tablelands GIS</cntorg>
    <cntpos>GIS Coordinator</cntpos>
    <address>PO Box 1616, 8 Tolga Rd</address>
    <city>Atherton</city>
    <state>QLD</state>
    <country>AUSTRALIA</country>
    <postal>4883</postal>
    <cntvoice>07 40918600</cntvoice>
    <cntfax>07 40917035</cntfax>
    <cntemail>[email protected]</cntemail>
  </cntinfo>
  <metainfo>
    <metd>
      <date />
    </metd>
  </metainfo>
</anzmeta>

--- Start of my script

import os, xml, shutil, datetime
from xml.etree import ElementTree as et

SourceDIR=os.getcwd()
outDIR=os.getcwd()+'//out'

def locatexml(SourceDIR,outDIR):
    xmllist=[]
    for root, dirs, files in os.walk(SourceDIR, topdown=False):
        for fl in files:
            currentFile=os.path.join(root, fl)
            ext=fl[fl.rfind('.')+1:]
            if ext=='xml':
                xmllist.append(currentFile)
                print currentFile
                readxml(currentFile)
    print "finished"
    return xmllist

def readxml(currentFile):
    tree=et.parse(currentFile)
    print "Processing: "+str(currentFile)

locatexml(SourceDIR,outDIR)
print xmllist

This is a job for XSLT, NOT regex. Please read this SO answer for some context. Also, post a sample of the CSV you want output from the sample XML. — Jim Garrison
– Jim Garrison, Commented Mar 6, 2013 at 4:46
Thanks Jim...i was just going by a suggestion in SO. I won't look at it any further. — GeorgeC
– GeorgeC, Commented Mar 6, 2013 at 5:02
ATOzTOA -the csv will just have the tag as the column header and then entries (the contents of the tag) for each xml which has the tag. — GeorgeC
– GeorgeC, Commented Mar 6, 2013 at 5:04

Community · Accepted Answer · 2017-05-23 12:27:15Z

1

You should really use XSLT to do this job as its a transformation of XML to another format. See the answer for this question for an example.

However, if you want to do it with lxml for some other reason, here is some code to get you started:

from lxml import etree

with open('test.xml') as f:
    tree = etree.parse(f)

# At this point, we can step through the xml file
# and parse it, here is an example of the `cntinfo` tag

for element in tree.iter('cntinfo'):
    for child in element.getchildren():
        print "{0.tag}: {0.text}".format(child)

This will print:

cntorg: Atherton Tablelands GIS
cntpos: GIS Coordinator
address: PO Box 1616, 8 Tolga Rd
city: Atherton
state: QLD
country: AUSTRALIA
postal: 4883
cntvoice: 07 40918600
cntfax: 07 40917035
cntemail: [email protected]

You can similarly step through the other elements in your file; but I strongly recommend you use XSLT.

This snippet will transform the xml document to csv using a xslt stylesheet (from this question):

# First, we load the stylesheet
with open(r'd:\test.xsl') as f:
    temp = etree.parse(f)
    style_sheet = etree.XSLT(temp)

# Apply it to the previously parsed document tree:
converted_xml = style_sheet(tree)

# Print the results:
str(converted_xml)

This will give you:

'"",    "<>",    "ATGISTablelands Regional Council"\r"<>",    "EPSG",
  "GDA94GRS80Map Grid of AustraliaZone 55 (144E - 150E)"\r"2012",    "Completed"
\r"OngoingCompleted",    "As RequiredAs Required"\r"FileType:",    "Format:Size"
,    "Internal Use Only"\r"~TBC~",    "Metadata imported.L:\\Data_Admin\\Metadat
aGenerator\\trc_Metadata_Template.xml2012120615341400",    "~TBC~",    "~TBC~",
   "~TBC~",    "~TBC~"\r"Atherton Tablelands GIS",    "GIS Coordinator",    "PO
Box 1616, 8 Tolga Rd",    "Atherton",    "QLD",    "AUSTRALIA",    "4883",    "0
7 40918600",    "07 40917035",    "[email protected]"\r""\r'

edited May 23, 2017 at 12:27

CommunityBot

11 silver badge

answered Mar 6, 2013 at 5:28

Burhan Khalid

175k20 gold badges254 silver badges291 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

GeorgeC Over a year ago

Thanks. This is a great re-start for me. I guess the issue was that tags are part of parent groups and you can't get at the raw tags by themselves. Isn't Lxml just a XSLT method for python (which I want to use)?

Burhan Khalid Over a year ago

No, lxml provides python bindings for the C libxml2 and libxslt2. You can use lxml to transform a xml document using xslt. I have updated the question to show how this can be done.

GeorgeC Over a year ago

Thanks. I went with the LXML route as there is no standard stylesheet to use, there are differences in each of the metadata file that is used as an input.

John Zwinck · Accepted Answer · 2013-03-06 04:49:24Z

0

<anzmeta> is the root of your document, so you should be trying to find one of its direct children (like citeinfo), not the root tag name itself.

answered Mar 6, 2013 at 4:49

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

1 Comment

GeorgeC Over a year ago

I tried but that didn't work either.

Collectives™ on Stack Overflow

Dump data from <></> tags in xml to csv in python (multiple different xml stylesheets formats) [duplicate]

2 Answers 2

3 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Linked

Related