0

I would like to convert this XML file:

<record id="idOne">
    <ts date="2019-07-03" time="15:28:41.720440">5</ts>
    <ts date="2019-07-03" time="15:28:42.629959">10</ts>
    <ts date="2019-07-03" time="15:28:43.552677">15</ts>
    <ts date="2019-07-03" time="15:28:43.855345">20</ts>
</record>

<record id="idOne">
    <ts date="2019-07-03" time="15:28:45.072922">30</ts>
    <ts date="2019-07-03" time="15:28:45.377087">35</ts>
    <ts date="2019-07-03" time="15:28:46.316321">40</ts>
    <ts date="2019-07-03" time="15:28:47.527960">45</ts>
</record>

to this CSV file:

ID, date, time, value
idOne, 2019-07-03, 15:28:41.720440, 5
idOne, 2019-07-03, 15:28:42.629959, 10
idOne, 2019-07-03, 15:28:43.552677, 15
idOne, 2019-07-03, 15:28:43.855345, 20
idOne, 2019-07-03, 15:28:45.072922, 30
idOne, 2019-07-03, 15:28:45.377087, 35
idOne, 2019-07-03, 15:28:46.316321, 40
idOne, 2019-07-03, 15:28:47.527960, 45

I can have several bodies of ID structures.

I use the lxml library.

I tried with the xpath method and for loop but I can only get the ID but not the rest. The problem is the second for loop, but I don't know how to deal with the values of "date" and "time"...

with open(args.input, "r") as f:
    # add root balises to parse the xml file
    records = itertools.chain('<root>', f, '</root>')
    root = etree.fromstringlist(records)

    #root = etree.fromstring(records)
    # count the number of records
    NumberRecords = int(root.xpath('count(//record)'))

    RecordsGrid = [[] for __ in range(NumberRecords)]
    tss = ["id","date", "time", "value"]
    paths = root.xpath('//record')
    #print(paths)
    Counter = 0
    for path in paths:

        for ts in tss[:1]:
            target = f'(./@{ts})'  # using f-strings to populate the full path
            if path.xpath(target):
                # we start populating our current sublist with the relevant info
                RecordsGrid[Counter].append(path.xpath(target)[0])
            else:
                RecordsGrid[Counter].append('NA')

        for ts in tss[1:]:  
            target = f'(./ts[@name="{ts}"]/text())'
            if path.xpath(target):
                RecordsGrid[Counter].append(path.xpath(target)[0])
            else:
                RecordsGrid[Counter].append('NA')
        Counter += 1

    # now that we have our lists, create a df
    df = pd.DataFrame(RecordsGrid, columns=tss)
    df.to_csv(args.output, sep=',', encoding='utf-8', index=False)

Here the result:

id,date,time,value
idOne,NA,NA,NA

Thanks for your time.

2
  • You forgot to include the code, include that in the post. Commented Jun 17, 2020 at 12:56
  • @Sushanth thanks, I updated the post Commented Jun 17, 2020 at 13:46

2 Answers 2

1

Try the following

from bs4 import BeautifulSoup as bs

data = list()

with open("data.xml") as xml:
    data_xml = bs(xml, "html.parser")
    for record in data_xml.find_all("record"):
        for ts in record.find_all("ts"):
            id_, date, time, value = record.get("id"), ts.get("date"), ts.get("time"), ts.text
            data.append(", ".join([id_, date, time, value]) + "\n")

with open("data.csv", "w") as csv:
    csv.write("ID, date, time, value\n")
    csv.writelines(data)
Sign up to request clarification or add additional context in comments.

Comments

0

To use lxml, you can simply pass the string as html(). By using the xpath //record/ts (starting with double backslash), you can fetch all your ts results. The main id can be accessed by calling .getparent() and then the attribute.

To convert xml to csv, I would recommend using the python package csv. You can use normal file writer. However csv write handles a lot of issues and it's cleaner.

In general, you have one method that handles everything. I would recommend splitting the logic into functions. Think Single Responsibility. Also the solution below I've converted the xml nodes into a NamedTupple and then write the namedTupple to csv. It's a lot easier to maintain/ read. (i.e Theres one place that sets the header text and one place populate the data).

from lxml import etree
import csv #py -m pip install python-csv
import collections
from collections import namedtuple

Record = namedtuple('Record', ['id', 'date', 'time', 'value']) # Model to store records.

def CreateCsvFile(results):
    with open('results.csv', 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=list(Record._fields)) # use the namedtuple fields for the headers 
        writer.writeheader()
        writer.writerows([x._asdict() for x in results]) # To use DictWriter, the namedtuple has to be converted to dictionary

def FormatRecord(xmlNode):
    return Record(xmlNode.getparent().attrib['id'], xmlNode.attrib["date"], xmlNode.attrib["time"], xmlNode.text)

def Main(html):
    xmlTree = etree.HTML(html)
    results = [FormatRecord(xmlNode) for xmlNode in xmlTree.xpath('//record/ts')] # the double backslash will retrieve all nodes for record.
    CreateCsvFile(results)

if __name__ == '__main__':
    Main("""<record id="idOne">
            <ts date="2019-07-03" time="15:28:41.720440">5</ts>
            <ts date="2019-07-03" time="15:28:42.629959">10</ts>
            <ts date="2019-07-03" time="15:28:43.552677">15</ts>
            <ts date="2019-07-03" time="15:28:43.855345">20</ts>
        </record>

        <record id="idTwo">
            <ts date="2019-07-03" time="15:28:45.072922">30</ts>
            <ts date="2019-07-03" time="15:28:45.377087">35</ts>
            <ts date="2019-07-03" time="15:28:46.316321">40</ts>
            <ts date="2019-07-03" time="15:28:47.527960">45</ts>
        </record>""")

3 Comments

I have a question, can you explain me the line, what do the function before the for loop please ? : results = [FormatRecord(xmlNode) for xmlNode in xmlTree.xpath('//record/ts')]
This is a short way or writing a for loop. xmlTree.xpath('//record/ts') will return a list of 8 ts items. I've called the item an xmlNode (as that describes what it contains. I probably should have gone for tsXmlNode). I then call the function FormatRecords(), which passes in the xmlNode. The FormatRecords() function will convert the xmlNode into a namedTuple called Record. The name tuple is then assigned to the variable results. The code is warped in square brackets which forced the code to iterate through.As a result, the variable results contains an array of namedtuple (called Record)
You can always write the for loop the long way (it's easier to debug, but it's bad practice as it create unnecessary code). results = [] for xmlNode in xmlTree.xpath('//record/ts'): item = FormatRecord(xmlNode) results.append(item) CreateCsvFile(results)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.