0

I have hundreds of XML files I need to convert to XLSX for downstream analysis, but I have far too many to do by hand. I have been trying to figure out a python way to do this, and have found a number of tutorials using either xml2xlsx or xml.etree.ElementTree. However, the problem I am running into is that all of these are dealing with singular sheets, whereas my xml has 3 sheets (see below sections for example). If someone knows how to do this, or how I should be going about parsing it apart, I would appreciate any help.

<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">
 <ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">
  <ProtectStructure>False</ProtectStructure>
  <ProtectWindows>False</ProtectWindows>
 </ExcelWorkbook>
 <Worksheet ss:Name="Nodes">
  <Table>
   <Row>
    <Cell><Data ss:Type="String">Node ID</Data></Cell>
    <Cell><Data ss:Type="String">X Coord</Data></Cell>
    <Cell><Data ss:Type="String">Y Coord</Data></Cell>
    <Cell><Data ss:Type="String">Z Coord</Data></Cell>
    <Cell><Data ss:Type="String">Coordination Number</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">0</Data></Cell>
    <Cell><Data ss:Type="Number">12.25489235</Data></Cell>
    <Cell><Data ss:Type="Number">21.835989</Data></Cell>
    <Cell><Data ss:Type="Number">6.916931152</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
   </Row>
...
  </Table>
  <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
   <ProtectObjects>False</ProtectObjects>
   <ProtectScenarios>False</ProtectScenarios>
  </WorksheetOptions>
 </Worksheet>
 <Worksheet ss:Name="Points">
  <Table>
   <Row>
    <Cell><Data ss:Type="String">Point ID</Data></Cell>
    <Cell><Data ss:Type="String">thickness</Data></Cell>
    <Cell><Data ss:Type="String">X Coord</Data></Cell>
    <Cell><Data ss:Type="String">Y Coord</Data></Cell>
    <Cell><Data ss:Type="String">Z Coord</Data></Cell>
   </Row>
...
  </Table>
  <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
   <ProtectObjects>False</ProtectObjects>
   <ProtectScenarios>False</ProtectScenarios>
  </WorksheetOptions>
 </Worksheet>
 <Worksheet ss:Name="Segments">
  <Table>
   <Row>
    <Cell><Data ss:Type="String">Segment ID</Data></Cell>
    <Cell><Data ss:Type="String">Node ID #1</Data></Cell>
    <Cell><Data ss:Type="String">Node ID #2</Data></Cell>
    <Cell><Data ss:Type="String">Point IDs</Data></Cell>
   </Row>
...

Here is what I have attempted so far, First using xml2xlsx, which returns an empty excel sheet:

from xml2xlsx import xml2xlsx
from lxml import etree
import sys
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('-f', '--files', nargs = '+', required = True, help='one or more xml files to convert')
#parser.add_argument('-c', '--conditions', nargs = '+', default = ['Condition 1', 'Condition 2'], help = 'Names of conditions to analyze. replace any spaces in filename with underscores.')
args = parser.parse_args()

for file in args.files:
    filename = os.path.splitext(os.path.basename(file))[0].lower()
    filename = filename.replace(' ', '_')
    output = filename+'.xlsx'
    with open(file) as xml_file:
        template = xml_file.read().encode()
    with open(output, 'wb') as xlsx_file:
        xlsx_file.write(xml2xlsx(template))

Then this prewritten code, I am having trouble getting to work.

#XML TO EXCEL FILE
import xml.etree.ElementTree as ET
from openpyxl import Workbook
import os 
import sys
import argparse

def readFile(filename):
    '''
        Checks if file exists, parses the file and extracts the needed data
        returns a 2 dimensional list without "header"
    '''
    if not os.path.exists(filename): return
    tree = ET.parse(filename)
    root = tree.getroot()
    #you may need to adjust the keys based on your file structure
    dict_keys = ["id","first_name","last_name","email","gender","ip_address" ] #all keys to be extracted from xml
    mdlist = []
    for child in root:
        temp = []
        for key in dict_keys:
            temp.append(child.find(key).text)
        mdlist.append(temp)
    return mdlist

def to_Excel(mdlist):
    '''
        Generates excel file with given data
        mdlist: 2 Dimenusional list containing data
    '''
    wb = Workbook()
    ws = wb.active
    for i,row in enumerate(mdlist):
        for j,value in enumerate(row):
            ws.cell(row=i+1, column=j+1).value = value
    newfilename = os.path.abspath("./xml_to_excel.xlsx")
    wb.save(newfilename)
    print("complete")
    return

parser = argparse.ArgumentParser()
parser.add_argument('-f', '--files', nargs = '+', required = True, help='one or more xml files to convert')
#parser.add_argument('-c', '--conditions', nargs = '+', default = ['Condition 1', 'Condition 2'], help = 'Names of conditions to analyze. replace any spaces in filename with underscores.')
args = parser.parse_args()

for f in args.files:
    result = readFile(f)
    if result:
        to_Excel(result)

And finally, this, which spat out a long single line of nonsense tags.

import xml.etree.ElementTree as ET
tree = ET.parse(r'pathToXML')
root = tree.getroot()
tag = root.tag
att = root.attrib
#Flatten XML to CSV
for child in root:
    for subchild in child:
        mainlevel = child.tag
        xmltocsv = ''
        for elem in root.iter():
            if elem.tag == root.tag:
                continue
            if elem.tag == mainlevel:
                xmltocsv = xmltocsv + '\n'
            xmltocsv = xmltocsv + str(elem.tag).rstrip() + str(elem.attrib).rstrip() + ';' + str(elem.text).rstrip() + ';'
with open('output.csv', 'w') as file:
    file.write(xmltocsv)
4
  • 1
    This isn't a code writing service. If you want help, you should at least make an honest attempt at a solution. Commented Oct 19, 2020 at 16:18
  • I added what I have tried so far. The main problem is I am very unfamiliar with XML files, so I thought showing everything I have done so far would just be more confusing than getting unbiased help. Commented Oct 19, 2020 at 16:46
  • Can you use MS Excel the application to open such .xml files? Commented Oct 19, 2020 at 19:11
  • Yea I ended up just converting them manually by opening them with excel and saving as .xlsx. I was hoping to automate it because there were 270 files but it ended up being faster to do it manually than try to figure it out this way. Commented Oct 19, 2020 at 19:16

1 Answer 1

3

Try this.

from openpyxl import Workbook
from simplified_scrapy import SimplifiedDoc, utils

def readFile(filename):
    xml = utils.getFileContent(filename)
    doc = SimplifiedDoc(xml)
    tables = doc.selects('Worksheet').selects('Row').selects('Cell').text # Get all data
    sheetNames = doc.selects('Worksheet>ss:Name()') # Get sheet name
    return sheetNames,tables

def to_Excel(sheetNames,tables):
    print(tables) # Test
    wb = Workbook() # Create Workbook
    for i in range(len(sheetNames)):
        worksheet = wb.create_sheet(sheetNames[i]) # Create sheet
        for row in tables[i]:
            worksheet.append(row)
    wb.save('xml_to_excel.xlsx') # Save file

to_Excel(*readFile("pathToXML"))
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot. I tested the code but produced blank xlsx file with no data at all
@YasserKhalil I added a line of code: print(tables). You can test it to see what is printed out.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.