Converting csv to xml using Python

Question

I have a csv file which resembles the format below:

===============================================================
#Type 1 Definition
#============================================================================
#TYPE, <name>
#Some tag for type------------------------------------------------------
#TYPESomeTag, <id>, <name>, <param>
#Another tag for type----------------------------------------------
#TYPEAnothertag, <param_1>, <param_2>, <param_3>
TYPE, Name_1
TYPESomeTag, 1, 2, 3
TYPESomeTag, 4, 2, 5
TYPEAnothertag, a, b, c

TYPE, Name_2
TYPESomeTag, 1, 2, 3
TYPESomeTag, 4, 2, 5
TYPEAnothertag, a, b, c

#===============================================================================
#Type 2 Definition
#===============================================================================
#TYPE2, <name>
#Some tag for type------------------------------------------------------
#TYPE2SomeTag, <id>, <name>, <param>
#Another tag for type----------------------------------------------
#TYPE2Anothertag, <param_1>, <param_2>, <param_3>
TYPE2, Name_1
TYPE2SomeTag, 1, 2, 3
TYPE2SomeTag, 4, 2, 5
TYPE2Anothertag, a, b, c

TYPE2, Name_2
TYPE2SomeTag, 1, 2, 3
TYPE2SomeTag, 4, 2, 5
TYPE2Anothertag, a, b, c

and so on...

My goal is to convert the above csv into xml format and I am using Python for the same. Here is how I started implementing this

for row in csv.reader(open(csvFile)):       
    if(row): #check for blank lines
       if row[0] == 'TYPE':
           xmlData.write('      ' + '<TYPE'+ row[1] + '>'+"\n")
       elif row[0] == 'TYPESomeTag'
            xmlData.write('      ' + '<TYPESomeTag'+ row[2] + '>'+"\n")
       elif
           #write some more tags
       else
         #something else
xmlData.close()

This approach that I follow is pretty shabby since its not easily extendible. I am comparing the first column of each row against a string. Now problem arises if there's another set of type definitions like TYPE2. Then I have to write another set of if..else statements which I think is not really the way to do this efficiently.

Could someone advice how can I do the task of converting the above csv to xml in a better way.

EDIT:

This is the xml that I am aiming for:

<tags>
 <TYPE Name_1>
   <TYPESomeTag>
    <id>1</id>
    <name>2</name>
    <param>3</param>
   </TYPESomeTag>
  <TYPESomeTag>
    <id>4</id>
    <name>2</name>
    <param>5</param>
   </TYPESomeTag>
  <TYPEAnothertag>
    <param_1>a</param_1>
    <param_2>b</param_2>
    <param_3>c</param_3>
   </TYPEAnothertag>
 </TYPE>
 <TYPE2 Name_2>
   <TYPE2SomeTag>
    <id>1</id>
    <name>2</name>
    <param>3</param>
   </TYPE2SomeTag>
  <TYPE2SomeTag>
    <id>4</id>
    <name>2</name>
    <param>5</param>
   </TYPE2SomeTag>
  <TYPE2Anothertag>
    <param_1>a</param_1>
    <param_2>b</param_2>
    <param_3>c</param_3>
   </TYPE2Anothertag>
 </TYPE2>
</tags>

Based on different tags, are you writing different content? For example, you are writing <TYPE + row[1] and '<TYPESomeTag'+ row[2]. Different index based on different type. — gaganso
– gaganso, Commented Jun 7, 2016 at 12:47
It's unclear what XML you want to produce. Could you provide xml corresponding to sample csv? — Dmitrii Bulashevich
– Dmitrii Bulashevich, Commented Jun 7, 2016 at 12:48
Can you be certain that there is a blank line between each TYPEx? If so, you can use that as a flag to start/end your parsing of each xml outer block — joel goldstick
– joel goldstick, Commented Jun 7, 2016 at 13:59
It was a typo @joelgoldstick, there is no blank line between each TYPEx — smyslov
– smyslov, Commented Jun 7, 2016 at 14:10

Serge Ballesta · Accepted Answer · 2016-06-08 09:12:31Z

Well that's a rather complex question:

you first need to parse the definition comments to determine the tag names that will be used in the xml file
you need to parse the csv file, skipping the comments and empty lines
you need to build a XML file (or tree) with the following rules:
- root tag is tags
- each line with only 2 element is a top level tag
- each line with more than 2 elements defines a subelement of the current top level element with
  - the tag name is first element of the line
  - every other element defines a subelement where the value gives the text of the xml tag and the tag name comes from the definition comments
you want the xml file in a pretty print mode

I would use:

a dedicated filter to process the comment lines and remove them before feeding a csv reader
a csv reader to parse the non comment lines
xml.etree.ElementTree to build the xml tree with the help of the tag names processed by the custom filter
xml.dom.minidom to pretty print the xml.

It ends in the following code:

import re
import csv
from xml.etree import ElementTree as ET
import xml.dom.minidom as minidom

class DefFilter:
    def __init__(self, fd, conf = None):
        if conf is None:self.conf = {}
        else: self.conf = conf
        self.fd = fd
        self.line = re.compile(r'#\s*(\w+)\s*((?:,\s*\<\w+\>)+)')
        self.tagname = re.compile(',\s*<(\w*)>((?:,\s*\<\w+\>)*)')
    def _parse_tags(self, line):
        l = []
        while True:
            m = self.tagname.match(line)
            #print('>', m.group(2), '<', sep='')
            l.append(m.group(1))
            if len(m.group(2)) == 0: return l
            line = m.group(2)
    def __iter__(self):
        return self
    def next(self):
        while True:
            line = next(self.fd).strip()
            if not line.startswith('#'): return line
            m = self.line.match(line)
            if m:
                self.conf[m.group(1)] = self._parse_tags(m.group(2))
    def __next__(self):
        return self.next()

class Parser:
    def __init__(self, conf = None):
        self.conf = conf
    def parse(self, fd):
        flt = DefFilter(fd, self.conf)
        rd = csv.reader(flt)
        root = ET.Element('tags')
        for row in rd:
            if len(row) ==2:
                name = 'name'
                tag = row[0].strip()
                try:
                    name = flt.conf[tag][0]
                except:
                    pass
                elt = ET.SubElement(root, tag, { name: row[1].strip() })
            elif len(row) > 2:
                tag = row[0].strip()
                x = ET.SubElement(elt, tag)
                tags = [ 'param_' + str(i+1) for i in range(len(row) - 1)]
                try:
                    tags = flt.conf[tag]
                except:
                    pass
                for i, val in enumerate(row[1:]):
                    y = ET.SubElement(x, tags[i])
                    y.text = val.strip()
        self.root = root
    def parsefile(self, filename):
        with open(filename) as fd:
            self.parse(fd)
    def prettyprint(self, fd, addindent = '  ', newl = '\n'):
        minidom.parseString(ET.tostring(p.root)).writexml(fd, newl = newl,
                                                          addindent=addindent)

You can then use:

with open('in.csv') as in, open('out.xml', 'w') as out:
    p = Parser()
    p.parse(in)
    p.prettyprint(out)

Parfait · Accepted Answer · 2016-06-08 02:58:46Z

Consider using an xml module to build the xml document and not concatenate string representation of elements. In this way, you can read the csv line by line adding children elements and text values conditionally according to line position. Below adds generic <tags> for grandchildren:

import csv
import lxml.etree as ET

# INITIATE TREE
root = ET.Element('tags')

# READ CSV LINE BY LINE 
cnt = 0; strtype = ''
with open('Type1.csv', 'r') as f:
    csvr = csv.reader(f)
    for line in csvr:
        # CONDITIONALLY ADD CHILDREN ATTRIB OR ELEMENTS
        if len(line) > 1:                
            if cnt==0 or line[0] == strtype:
                strtype = line[0]
                typeNode = ET.SubElement(root, strtype.strip())
                typeNode.set('attr', line[1].strip())

            if cnt >= 1:                
                typesomeNode = ET.SubElement(typeNode, line[0].strip())            
                ET.SubElement(typesomeNode, 'tag').text = line[1].strip()            
                ET.SubElement(typesomeNode, 'tag').text = line[2].strip()            
                ET.SubElement(typesomeNode, 'tag').text = line[3].strip()                    
        else:
            cnt = 0
            continue            
        cnt += 1

# CONVERT TREE TO STRING W/ INDENTATION
tree_out = ET.tostring(root, pretty_print=True)
print(tree_out.decode("utf-8"))

To replace generic tags with <id>, <name>, <param>, <param1>... etc., consider XSLT (the transformation language used to redesign/restructure xml documents). And Python's lxml module can run such XSLT 1.0 scripts. This is one approach to avoid the many conditionals on first read above:

xslt_str = '''
            <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
            <xsl:output version="1.0" encoding="UTF-8" indent="yes" />
            <xsl:strip-space elements="*"/>

              <!-- Identity Transform -->
              <xsl:template match="@*|node()">
                <xsl:copy>
                  <xsl:apply-templates select="@*|node()"/>
                </xsl:copy>
              </xsl:template>

              <xsl:template match="TYPESomeTag|TYPE2SomeTag">
                <xsl:copy>
                  <id><xsl:value-of select="tag[1]"/></id>
                  <name><xsl:value-of select="tag[2]"/></name>
                  <param><xsl:value-of select="tag[3]"/></param>
                </xsl:copy>
              </xsl:template>

              <xsl:template match="TYPEAnothertag|TYPE2Anothertag">
                <xsl:copy>
                  <param_1><xsl:value-of select="tag[1]"/></param_1>
                  <param_2><xsl:value-of select="tag[2]"/></param_2>
                  <param_3><xsl:value-of select="tag[3]"/></param_3>
                </xsl:copy>
              </xsl:template>                    
            </xsl:transform>
'''    
# PARSE XSL STRING (CAN ALSO READ FROM FILE)
xslt = ET.fromstring(xslt_str)
# TRANSFORM SOURCE XML WITH XSLT
transform = ET.XSLT(xslt)
newdom = transform(root)    
print(str(newdom))

Output (for TYPE1 but also similar for TYPE2)

<?xml version="1.0"?>
<tags>
  <TYPE attr="Name_1">
    <TYPESomeTag>
      <id>1</id>
      <name>2</name>
      <param>3</param>
    </TYPESomeTag>
    <TYPESomeTag>
      <id>4</id>
      <name>2</name>
      <param>5</param>
    </TYPESomeTag>
    <TYPEAnothertag>
      <param_1>a</param_1>
      <param_2>b</param_2>
      <param_3>c</param_3>
    </TYPEAnothertag>
  </TYPE>
  <TYPE attr="Name_2">
    <TYPESomeTag>
      <id>1</id>
      <name>2</name>
      <param>3</param>
    </TYPESomeTag>
    <TYPESomeTag>
      <id>4</id>
      <name>2</name>
      <param>5</param>
    </TYPESomeTag>
    <TYPEAnothertag>
      <param_1>a</param_1>
      <param_2>b</param_2>
      <param_3>c</param_3>
    </TYPEAnothertag>
  </TYPE>
</tags>

XSLT is a nice technology for applying tags, but it needs to parse comments of input csv to build XSLT. Or hand-coding it. So your answer is more elegant than mine, but needs to be enhanced with comments parser.

Dmitrii Bulashevich · Accepted Answer · 2016-06-07 16:03:33Z

You need to store parameters from commented line to dictionary to process

#TYPESomeTag, id, name, param

into

tags = {"TYPESomeTag":["id", "name", "param"]}

this way you can parse every comment line without handcoding parameters list. Below is sample code to process your given csv.

import csv

csvFile = 'sample.csv'

nextLineIsTagName = False
tags = dict()
tag = None
tagOpened = False

for row in csv.reader(open(csvFile), skipinitialspace=True):
    if not row: #skipping empty lines
        continue

    if row[0][0] == '#': #processing types definition within csv comment block
        if tagOpened: #there is opened tag so we need to close it
            print "</" + tag + ">"
            tags = dict()
            tag = None
            tagOpened = False

        if (len(row) == 1) and 'Definition' in row[0]:
            nextLineIsTagName = True
            continue

        if nextLineIsTagName and len(row) == 2:
            tag = row[0][1:]
            nextLineIsTagName = False
            continue

        if not nextLineIsTagName and len(row) > 1:
            tags[row[0][1:]] = row[1:] #adding 'parameters' to 'tag' dict entry

    else: #processing csv data
        if len(row) < 2:
            continue

        if row[0] == tag: #we need to start new TYPE element
            if tagOpened: #close previous tag before open new one
                print "</" + tag + ">"

            print "<" + tag, row[1] + ">"
            tagOpened = True
        else: #we need to add parameters to open TYPE element
            print "\t<"  + row[0] + ">"
            for i in range(1, len(row)): #iterating over parameters
                print "\t\t<" + tags[row[0]][i-1] + ">" + row[i] + "</" + tags[row[0]][i-1] + ">"
            print "\t</" + row[0] + ">"

if tagOpened: #closing last tag at end of file
    print "</"+ tag + ">"

Collectives™ on Stack Overflow

Converting csv to xml using Python

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related