Convert dynamic XML file to CSV file - Python

Question

I would like to convert this XML file:

<record id="idOne">
    <ts date="2019-07-03" time="15:28:41.720440">5</ts>
    <ts date="2019-07-03" time="15:28:42.629959">10</ts>
    <ts date="2019-07-03" time="15:28:43.552677">15</ts>
    <ts date="2019-07-03" time="15:28:43.855345">20</ts>
</record>

<record id="idOne">
    <ts date="2019-07-03" time="15:28:45.072922">30</ts>
    <ts date="2019-07-03" time="15:28:45.377087">35</ts>
    <ts date="2019-07-03" time="15:28:46.316321">40</ts>
    <ts date="2019-07-03" time="15:28:47.527960">45</ts>
</record>

to this CSV file:

ID, date, time, value
idOne, 2019-07-03, 15:28:41.720440, 5
idOne, 2019-07-03, 15:28:42.629959, 10
idOne, 2019-07-03, 15:28:43.552677, 15
idOne, 2019-07-03, 15:28:43.855345, 20
idOne, 2019-07-03, 15:28:45.072922, 30
idOne, 2019-07-03, 15:28:45.377087, 35
idOne, 2019-07-03, 15:28:46.316321, 40
idOne, 2019-07-03, 15:28:47.527960, 45

I can have several bodies of ID structures.

I use the lxml library.

I tried with the xpath method and for loop but I can only get the ID but not the rest. The problem is the second for loop, but I don't know how to deal with the values of "date" and "time"...

with open(args.input, "r") as f:
    # add root balises to parse the xml file
    records = itertools.chain('<root>', f, '</root>')
    root = etree.fromstringlist(records)

    #root = etree.fromstring(records)
    # count the number of records
    NumberRecords = int(root.xpath('count(//record)'))

    RecordsGrid = [[] for __ in range(NumberRecords)]
    tss = ["id","date", "time", "value"]
    paths = root.xpath('//record')
    #print(paths)
    Counter = 0
    for path in paths:

        for ts in tss[:1]:
            target = f'(./@{ts})'  # using f-strings to populate the full path
            if path.xpath(target):
                # we start populating our current sublist with the relevant info
                RecordsGrid[Counter].append(path.xpath(target)[0])
            else:
                RecordsGrid[Counter].append('NA')

        for ts in tss[1:]:  
            target = f'(./ts[@name="{ts}"]/text())'
            if path.xpath(target):
                RecordsGrid[Counter].append(path.xpath(target)[0])
            else:
                RecordsGrid[Counter].append('NA')
        Counter += 1

    # now that we have our lists, create a df
    df = pd.DataFrame(RecordsGrid, columns=tss)
    df.to_csv(args.output, sep=',', encoding='utf-8', index=False)

Here the result:

id,date,time,value
idOne,NA,NA,NA

Thanks for your time.

You forgot to include the code, include that in the post.

sushanth
– sushanth

2020-06-17 12:56:48 +00:00
Commented Jun 17, 2020 at 12:56 — sushanth
– sushanth, Commented Jun 17, 2020 at 12:56
@Sushanth thanks, I updated the post

Zebra125
– Zebra125

2020-06-17 13:46:48 +00:00
Commented Jun 17, 2020 at 13:46 — Zebra125
– Zebra125, Commented Jun 17, 2020 at 13:46

Artyom Vancyan · Accepted Answer · 2022-03-06 10:44:39Z

1

Try the following

from bs4 import BeautifulSoup as bs

data = list()

with open("data.xml") as xml:
    data_xml = bs(xml, "html.parser")
    for record in data_xml.find_all("record"):
        for ts in record.find_all("ts"):
            id_, date, time, value = record.get("id"), ts.get("date"), ts.get("time"), ts.text
            data.append(", ".join([id_, date, time, value]) + "\n")

with open("data.csv", "w") as csv:
    csv.write("ID, date, time, value\n")
    csv.writelines(data)

edited Mar 6, 2022 at 10:44

answered Jun 17, 2020 at 13:40

Artyom Vancyan

5,4023 gold badges17 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Greg · Accepted Answer · 2020-06-17 14:41:42Z

0

To use lxml, you can simply pass the string as html(). By using the xpath //record/ts (starting with double backslash), you can fetch all your ts results. The main id can be accessed by calling .getparent() and then the attribute.

To convert xml to csv, I would recommend using the python package csv. You can use normal file writer. However csv write handles a lot of issues and it's cleaner.

In general, you have one method that handles everything. I would recommend splitting the logic into functions. Think Single Responsibility. Also the solution below I've converted the xml nodes into a NamedTupple and then write the namedTupple to csv. It's a lot easier to maintain/ read. (i.e Theres one place that sets the header text and one place populate the data).

from lxml import etree
import csv #py -m pip install python-csv
import collections
from collections import namedtuple

Record = namedtuple('Record', ['id', 'date', 'time', 'value']) # Model to store records.

def CreateCsvFile(results):
    with open('results.csv', 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=list(Record._fields)) # use the namedtuple fields for the headers 
        writer.writeheader()
        writer.writerows([x._asdict() for x in results]) # To use DictWriter, the namedtuple has to be converted to dictionary

def FormatRecord(xmlNode):
    return Record(xmlNode.getparent().attrib['id'], xmlNode.attrib["date"], xmlNode.attrib["time"], xmlNode.text)

def Main(html):
    xmlTree = etree.HTML(html)
    results = [FormatRecord(xmlNode) for xmlNode in xmlTree.xpath('//record/ts')] # the double backslash will retrieve all nodes for record.
    CreateCsvFile(results)

if __name__ == '__main__':
    Main("""<record id="idOne">
            <ts date="2019-07-03" time="15:28:41.720440">5</ts>
            <ts date="2019-07-03" time="15:28:42.629959">10</ts>
            <ts date="2019-07-03" time="15:28:43.552677">15</ts>
            <ts date="2019-07-03" time="15:28:43.855345">20</ts>
        </record>

        <record id="idTwo">
            <ts date="2019-07-03" time="15:28:45.072922">30</ts>
            <ts date="2019-07-03" time="15:28:45.377087">35</ts>
            <ts date="2019-07-03" time="15:28:46.316321">40</ts>
            <ts date="2019-07-03" time="15:28:47.527960">45</ts>
        </record>""")

answered Jun 17, 2020 at 14:41

Greg

4,5383 gold badges19 silver badges28 bronze badges

3 Comments

Zebra125 Over a year ago

I have a question, can you explain me the line, what do the function before the for loop please ? : results = [FormatRecord(xmlNode) for xmlNode in xmlTree.xpath('//record/ts')]

Greg Over a year ago

This is a short way or writing a for loop. xmlTree.xpath('//record/ts') will return a list of 8 ts items. I've called the item an xmlNode (as that describes what it contains. I probably should have gone for tsXmlNode). I then call the function FormatRecords(), which passes in the xmlNode. The FormatRecords() function will convert the xmlNode into a namedTuple called Record. The name tuple is then assigned to the variable results. The code is warped in square brackets which forced the code to iterate through.As a result, the variable results contains an array of namedtuple (called Record)

Greg Over a year ago

You can always write the for loop the long way (it's easier to debug, but it's bad practice as it create unnecessary code). results = [] for xmlNode in xmlTree.xpath('//record/ts'): item = FormatRecord(xmlNode) results.append(item) CreateCsvFile(results)

Collectives™ on Stack Overflow

Convert dynamic XML file to CSV file - Python

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related