Parse old excel xml in python

Question

I downloaded some excel xml from the web and try to parse it. I tried many solutions and none of them work, for example using xlrd, xml parse, elementTree or BeautifullSoup. Here is what the xml looks like

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
<ss:Style ss:ID="wraptext">
<ss:Alignment ss:Horizontal="Left" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
</ss:Style>
<ss:Style ss:ID="disclaimer">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
</ss:Style>
<ss:Style ss:ID="DefaultHyperlink">
<ss:Alignment ss:Vertical="Center" ss:WrapText="1"/>
<ss:Font ss:Color="#0000FF" ss:Underline="Single" />
</ss:Style>
<ss:Style ss:ID="headerstyle">
<ss:Font ss:Bold="1" />
</ss:Style>
<ss:Style ss:ID="Date">
<ss:NumberFormat ss:Format="dd\-mmm\-yyyy"/>
</ss:Style>
<ss:Style ss:ID="Left">
<ss:Alignment ss:Horizontal="Left"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
<ss:Style ss:ID="Right">
<ss:Alignment ss:Horizontal="Right"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
</ss:Styles>
<ss:Worksheet ss:Name="Holdings">
<ss:Table>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">06-Oct-2020</ss:Data>
</ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">iShares Russell Top 200 Value ETF</ss:Data>
</ss:Cell>
</ss:Row>
.
.
.

Or you can download the full xml here

Ultimately I will need to convert the file into DataFrame, but now I am open to any solutions, maybe convert to csv first. Can anyone help?

import xml.etree.ElementTree as et response = requests.get(url, headers=headers) parsed = et.parse(str(response.text)) print(parsed.getroot()) — Hayton Leung
– Hayton Leung, Commented Oct 9, 2020 at 8:41
and this soup = BeautifulSoup(str(response.text), 'xml') workbook = [] for sheet in soup.findAll('Worksheet'): sheet_as_list = [] for row in sheet.findAll('Row'): row_as_list = [] for cell in row.findAll('Cell'): row_as_list.append(cell.Data.text) sheet_as_list.append(row_as_list) workbook.append(sheet_as_list) print(len(workbook)) it prints 0 — Hayton Leung
– Hayton Leung, Commented Oct 9, 2020 at 8:46

dabingsou · Accepted Answer · 2020-10-09 09:34:40Z

4

Another method.

from simplified_scrapy import SimplifiedDoc, utils, req
xml = req.get(
    'https://www.ishares.com/us/products/239722/ishares-russell-top-200-value-etf/1521942788811.ajax?fileType=xls&fileName=iShares-Russell-Top-200-Value-ETF_fund&dataType=fund'
)
xml = xml.read().decode('utf-8')
doc = SimplifiedDoc(xml)
worksheets = doc.selects('ss:Worksheet') # Get all Worksheets
for worksheet in worksheets:
    rows = worksheet.selects('ss:Row').selects('ss:Cell>text()') # Get all rows
    utils.save2csv(worksheet['ss:Name'] + '.csv', rows) # Save data to csv

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

answered Oct 9, 2020 at 9:34

dabingsou

2,4691 gold badge7 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hayton Leung Over a year ago

Thanks. It works like a charm. At least I can get the worksheets as csv and can start from here.

Hayton Leung · Accepted Answer · 2020-10-12 03:54:04Z

1

Thanks for the answer above, it gave me some insights on a better solution.

Turned out I was able to parse the response with beautifulsoup with the correct decoding scheme ('utf-8'). Another thing is beautifulsoup was not able to pickup the tag name <ss:Worksheet>, but able to pickup <ss:worksheet>.

In this case I would not have to import another module.

answered Oct 12, 2020 at 3:54

Hayton Leung

961 gold badge1 silver badge9 bronze badges

Collectives™ on Stack Overflow

Parse old excel xml in python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related