Batch export xml files to csv using python

Question

I am new to python so please bear with me with silly questions I have multiple xml in the following format and I would like to extract certain tags within those xmls and export them to a single csv file.

Here is an example of the xml (c:\xml\1.xml)

<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="emotionStyleSheet_template.xsl"?>
<EmotionReport>
    <VersionInformation>
        <Version>8.2.0</Version>
    </VersionInformation>
    <DateTime>
        <Date>18-10-2021</Date>
        <Time>14-12-26</Time>
    </DateTime>
    <SourceInformation>
        <File>
            <FilePath>//nas/emotionxml</FilePath>
            <FileName>file001.mxf</FileName>
            <FileSize>9972536969</FileSize>
            <FileAudioInformation>
                <AudioDuration>1345.0</AudioDuration>
                <SampleRate>48000</SampleRate>
                <NumChannels>8</NumChannels>
                <BitsPerSample>24</BitsPerSample>
                <AudioSampleGroups>64560000</AudioSampleGroups>
                <NumStreams>8</NumStreams>
                <Container>Undefined Sound</Container>
                <Description>IMC Nexio
</Description>
                <StreamInformation>
                    <Stream>
                        <StreamNumber>1</StreamNumber>
                        <NumChannelsInStream>1</NumChannelsInStream>
                        <Channel>
                            <ChannelNumber>1</ChannelNumber>
                            <ChannelEncoding>PCM</ChannelEncoding>
                        </Channel>
                    </Stream>
                    <Stream>
                        <StreamNumber>2</StreamNumber>
                        <NumChannelsInStream>1</NumChannelsInStream>
                        <Channel>
                            <ChannelNumber>1</ChannelNumber>
                            <ChannelEncoding>PCM</ChannelEncoding>
                        </Channel>
                    </Stream>
                </StreamInformation>
                <FileTimecodeInformation>
                    <FrameRate>25.00</FrameRate>
                    <DropFrame>false</DropFrame>
                    <StartTimecode>00:00:00:00</StartTimecode>
                </FileTimecodeInformation>
            </FileAudioInformation>
        </File>
    </SourceInformation>
</EmotionReport>

expect output result (EmotionData.csv)

,Date,Time,FileName,Description,FileSize,FilePath
0,18-10-2021,14-12-26,file001.mxf,IMC Nexio,9972536969,//nas/emotionxml
1,13-10-2021,08-12-26,file002.mxf,IMC Nexio,3566536770,//nas/emotionxml
2,03-10-2021,02-09-21,file003.mxf,IMC Nexio,46357672,//nas/emotionxml
....

Here is the code I've wrote based on what I've learned from online resources (emotion_xml_parser.py):

import xml.etree.ElementTree as ET
import glob2
import pandas as pd

cols = ["Date", "Time", "FileName", "Description", "FileSize", "FilePath"]
rows = []
for filename in glob2.glob(r'C:\xml\*.xml'):
  xmlData = ET.parse(filename)
  rootXML = xmlData.getroot()
  for i in rootXML:
    Date = i.findall("Date").text
    Time = i.findall("Time").text
    FileName = i.findall("FileName").text
    Description = i.findall("Description").text
    FileSize = i.findall("FileSize").text
    FilePath = i.findall("FilePath").text

    row.append({"Date": Date,
                "Time": Time,
                "FileName": FileName,
                "Description": Description,
                "FileSize": FileSize,
                "FilePath": FilePath,})
df = pd.DataFrame(rows,columns = cols)

# Write dataframe to csv
df.to_csv("EmotionData.csv")

I am receiving the following error when running the script

  File "c:\emtion_xml_parser.py", line 14, in <module>
    Date = i.findall("Date").text
AttributeError: 'list' object has no attribute 'text'

TIA!

findall() returns a list of xml elements. You will need to choose one element in this list to access its text attribute. If you know there's only one Date tag, you can use i.find("Date").text instead of findall(). — rchome
– rchome, Commented Nov 26, 2021 at 4:46
@rchome i tried using find() initially and i got the following error: File "c:\emtion_xml_parser.py", line 13, in <module> Date = i.find("Date").text AttributeError: 'NoneType' object has no attribute 'text' and those tag names i am after is unique in xml — danh
– danh, Commented Nov 26, 2021 at 4:49
I see, so some files may not have a Date tag. Is that correct? — rchome
– rchome, Commented Nov 26, 2021 at 4:51
@rchome I have duplicated 3 copies of the example file which i can confirm they all they have those tags in them. — danh
– danh, Commented Nov 26, 2021 at 4:58

Martin Evans · Accepted Answer · 2021-11-27 09:10:18Z

2

A better approach is to give the full path to each element you need, for example:

import xml.etree.ElementTree as ET
import glob2
import pandas as pd

cols = ["Date", "Time", "FileName", "Description", "FileSize", "FilePath"]
rows = []

for filename in glob2.glob(r'*.xml'):
    xmlData = ET.parse(filename)
    root = xmlData.getroot()
  
    row = {
        'Date' : root.findtext('DateTime/Date'),
        'Time' : root.findtext('DateTime/Time'),
        'FileName' : root.findtext('SourceInformation/File/FileName'),
        'Description' : root.findtext('SourceInformation/File/FileAudioInformation/Description').strip(),
        'FileSize' : root.findtext('SourceInformation/File/FileSize'),
        'FilePath' : root.findtext('SourceInformation/File/FilePath')
    }

    rows.append(row)

df = pd.DataFrame(rows, columns=cols)

# Write dataframe to csv
df.to_csv("EmotionData.csv")

Giving you:

,Date,Time,FileName,Description,FileSize,FilePath
0,18-10-2021,14-12-26,file001.mxf,IMC Nexio,9972536969,//nas/emotionxml

edited Nov 27, 2021 at 9:10

answered Nov 26, 2021 at 9:26

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

danh Over a year ago

great, this is working for me. Thank you. Could you please explain to me a little more of what "row = {}" does? the curly braces is used to define dictionary in python but in this case its empty?

Martin Evans Over a year ago

It creates an empty dictionary so it can be used in the next lines

Martin Evans Over a year ago

You could also just create the entries directly in one go but sometimes extra code is needed when extracting values

Collectives™ on Stack Overflow

Batch export xml files to csv using python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related