1

I am new to python so please bear with me with silly questions I have multiple xml in the following format and I would like to extract certain tags within those xmls and export them to a single csv file.

Here is an example of the xml (c:\xml\1.xml)

<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="emotionStyleSheet_template.xsl"?>
<EmotionReport>
    <VersionInformation>
        <Version>8.2.0</Version>
    </VersionInformation>
    <DateTime>
        <Date>18-10-2021</Date>
        <Time>14-12-26</Time>
    </DateTime>
    <SourceInformation>
        <File>
            <FilePath>//nas/emotionxml</FilePath>
            <FileName>file001.mxf</FileName>
            <FileSize>9972536969</FileSize>
            <FileAudioInformation>
                <AudioDuration>1345.0</AudioDuration>
                <SampleRate>48000</SampleRate>
                <NumChannels>8</NumChannels>
                <BitsPerSample>24</BitsPerSample>
                <AudioSampleGroups>64560000</AudioSampleGroups>
                <NumStreams>8</NumStreams>
                <Container>Undefined Sound</Container>
                <Description>IMC Nexio
</Description>
                <StreamInformation>
                    <Stream>
                        <StreamNumber>1</StreamNumber>
                        <NumChannelsInStream>1</NumChannelsInStream>
                        <Channel>
                            <ChannelNumber>1</ChannelNumber>
                            <ChannelEncoding>PCM</ChannelEncoding>
                        </Channel>
                    </Stream>
                    <Stream>
                        <StreamNumber>2</StreamNumber>
                        <NumChannelsInStream>1</NumChannelsInStream>
                        <Channel>
                            <ChannelNumber>1</ChannelNumber>
                            <ChannelEncoding>PCM</ChannelEncoding>
                        </Channel>
                    </Stream>
                </StreamInformation>
                <FileTimecodeInformation>
                    <FrameRate>25.00</FrameRate>
                    <DropFrame>false</DropFrame>
                    <StartTimecode>00:00:00:00</StartTimecode>
                </FileTimecodeInformation>
            </FileAudioInformation>
        </File>
    </SourceInformation>
</EmotionReport>

expect output result (EmotionData.csv)

,Date,Time,FileName,Description,FileSize,FilePath
0,18-10-2021,14-12-26,file001.mxf,IMC Nexio,9972536969,//nas/emotionxml
1,13-10-2021,08-12-26,file002.mxf,IMC Nexio,3566536770,//nas/emotionxml
2,03-10-2021,02-09-21,file003.mxf,IMC Nexio,46357672,//nas/emotionxml
....

Here is the code I've wrote based on what I've learned from online resources (emotion_xml_parser.py):

import xml.etree.ElementTree as ET
import glob2
import pandas as pd

cols = ["Date", "Time", "FileName", "Description", "FileSize", "FilePath"]
rows = []
for filename in glob2.glob(r'C:\xml\*.xml'):
  xmlData = ET.parse(filename)
  rootXML = xmlData.getroot()
  for i in rootXML:
    Date = i.findall("Date").text
    Time = i.findall("Time").text
    FileName = i.findall("FileName").text
    Description = i.findall("Description").text
    FileSize = i.findall("FileSize").text
    FilePath = i.findall("FilePath").text

    row.append({"Date": Date,
                "Time": Time,
                "FileName": FileName,
                "Description": Description,
                "FileSize": FileSize,
                "FilePath": FilePath,})
df = pd.DataFrame(rows,columns = cols)

# Write dataframe to csv
df.to_csv("EmotionData.csv")

I am receiving the following error when running the script

  File "c:\emtion_xml_parser.py", line 14, in <module>
    Date = i.findall("Date").text
AttributeError: 'list' object has no attribute 'text'

TIA!

6
  • findall() returns a list of xml elements. You will need to choose one element in this list to access its text attribute. If you know there's only one Date tag, you can use i.find("Date").text instead of findall(). Commented Nov 26, 2021 at 4:46
  • @rchome i tried using find() initially and i got the following error: File "c:\emtion_xml_parser.py", line 13, in <module> Date = i.find("Date").text AttributeError: 'NoneType' object has no attribute 'text' and those tag names i am after is unique in xml Commented Nov 26, 2021 at 4:49
  • I see, so some files may not have a Date tag. Is that correct? Commented Nov 26, 2021 at 4:51
  • @rchome I have duplicated 3 copies of the example file which i can confirm they all they have those tags in them. Commented Nov 26, 2021 at 4:58
  • 1
    Have you tried beautifulsoup ? Commented Nov 26, 2021 at 6:20

1 Answer 1

2

A better approach is to give the full path to each element you need, for example:

import xml.etree.ElementTree as ET
import glob2
import pandas as pd

cols = ["Date", "Time", "FileName", "Description", "FileSize", "FilePath"]
rows = []

for filename in glob2.glob(r'*.xml'):
    xmlData = ET.parse(filename)
    root = xmlData.getroot()
  
    row = {
        'Date' : root.findtext('DateTime/Date'),
        'Time' : root.findtext('DateTime/Time'),
        'FileName' : root.findtext('SourceInformation/File/FileName'),
        'Description' : root.findtext('SourceInformation/File/FileAudioInformation/Description').strip(),
        'FileSize' : root.findtext('SourceInformation/File/FileSize'),
        'FilePath' : root.findtext('SourceInformation/File/FilePath')
    }

    rows.append(row)

df = pd.DataFrame(rows, columns=cols)

# Write dataframe to csv
df.to_csv("EmotionData.csv")        

Giving you:

,Date,Time,FileName,Description,FileSize,FilePath
0,18-10-2021,14-12-26,file001.mxf,IMC Nexio,9972536969,//nas/emotionxml
Sign up to request clarification or add additional context in comments.

3 Comments

great, this is working for me. Thank you. Could you please explain to me a little more of what "row = {}" does? the curly braces is used to define dictionary in python but in this case its empty?
It creates an empty dictionary so it can be used in the next lines
You could also just create the entries directly in one go but sometimes extra code is needed when extracting values

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.