1

I have been trying to scrape an XML file to copy content from 2 tags, Code and Source only. The xml file looks as follows:

<Series xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RunDate>2018-06-12</RunDate>
  <Instruments>
    <Instrument>
      <Code>27BA1</Code>
      <Source>YYY</Source>
    </Instrument>
    <Instrument>
      <Code>28BA1</Code>
      <Source>XXX</Source>
    </Instrument>
      <Code>29BA1</Code>
      <Source>XXX</Source>
    </Instrument>
      <Code>30BA1</Code>
      <Source>DDD</Source>
    </Instrument>
  </Instruments>
</Series>

I'm only getting it right to scrape the first code. Below is the code. Can anyone help?

import xml.etree.ElementTree as ET
import csv

tree = ET.parse("data.xml")
csv_fname = "data.csv"
root = tree.getroot()

f = open(csv_fname, 'w')
csvwriter = csv.writer(f)
count = 0
head = ['Code', 'Source']

csvwriter.writerow(head)

for time in root.findall('Instruments'):
    row = []
    job_name = time.find('Instrument').find('Code').text
    row.append(job_name)
    job_name_1 = time.find('Instrument').find('Source').text
    row.append(job_name_1)
    csvwriter.writerow(row)
f.close()

2 Answers 2

5

The XML file given by you in the post is invalid. Check by pasting the file here. https://www.w3schools.com/xml/xml_validator.asp

The valid xml I assume would be

<Series xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RunDate>2018-06-12</RunDate>
  <Instruments>
    <Instrument>
      <Code>27BA1</Code>
      <Source>YYY</Source>
    </Instrument>
    <Instrument>
      <Code>28BA1</Code>
      <Source>XXX</Source>
    </Instrument>
    <Instrument>
      <Code>29BA1</Code>
      <Source>XXX</Source>
    </Instrument>
    <Instrument>
      <Code>30BA1</Code>
      <Source>DDD</Source>
    </Instrument>
  </Instruments>
</Series>

To print values in Code and Source tags.

from lxml import etree
root = etree.parse('data.xml').getroot()
instruments = root.find('Instruments')
instrument = instruments.findall('Instrument')
for grandchild in instrument:
    code, source = grandchild.find('Code'), grandchild.find('Source')
    print (code.text), (source.text)
Sign up to request clarification or add additional context in comments.

Comments

0

If you are able to run xslt against your document - I assume you can - an alternative approach would make this very straightforward:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
>
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:text>Code,Source</xsl:text><xsl:text>&#xa;</xsl:text>
    <xsl:apply-templates select="//Instrument"/>
  </xsl:template>
  <xsl:template match="Instrument">
<xsl:value-of select="Code"/>,<xsl:value-of select="Source"/><xsl:text>&#xa;</xsl:text>
</xsl:template>
</xsl:stylesheet>

Note the presence of the <xsl:text>&#xa;</xsl:text> element - this is to insert the line breaks which are semantically important in CSV, but not in XML.

Output:

Code,Source
27BA1,YYY
28BA1,XXX
29BA1,XXX
30BA1,DDD

To run this in Python I guess you'd need something like the approach suggested in this question:

import lxml.etree as ET

dom = ET.parse(xml_filename)
xslt = ET.parse(xsl_filename)
transform = ET.XSLT(xslt)
newdom = transform(dom)
print(ET.tostring(newdom, pretty_print=True))

I don't use Python, so I have no idea whether this is correct or not.

Whoops - I also neglected to mention that your XML document is not valid - there are missing opening <Instrument> elements on lines 11 and 14. Adding these where they belong makes the document transform correctly.

3 Comments

Hi. I have not idea how to do that. Any guidance would be appreciated. Thanks
You haven't specified anything about the language or environment you're using. I don't recognise the language in your question - so by extension I also don't know what you're using to execute it. The best way to run a stylesheet against a document depends on your tools - could you please specify in the question. Thanks.
I don't think this is what I'm looking for. I'm just looking for someone to [lease look at my Python code and tell me what I'm doing wrong. Not looking to use xslt. Thanks for your help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.