1

I downloaded some excel xml from the web and try to parse it. I tried many solutions and none of them work, for example using xlrd, xml parse, elementTree or BeautifullSoup. Here is what the xml looks like

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
<ss:Style ss:ID="wraptext">
<ss:Alignment ss:Horizontal="Left" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
</ss:Style>
<ss:Style ss:ID="disclaimer">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
</ss:Style>
<ss:Style ss:ID="DefaultHyperlink">
<ss:Alignment ss:Vertical="Center" ss:WrapText="1"/>
<ss:Font ss:Color="#0000FF" ss:Underline="Single" />
</ss:Style>
<ss:Style ss:ID="headerstyle">
<ss:Font ss:Bold="1" />
</ss:Style>
<ss:Style ss:ID="Date">
<ss:NumberFormat ss:Format="dd\-mmm\-yyyy"/>
</ss:Style>
<ss:Style ss:ID="Left">
<ss:Alignment ss:Horizontal="Left"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
<ss:Style ss:ID="Right">
<ss:Alignment ss:Horizontal="Right"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
</ss:Styles>
<ss:Worksheet ss:Name="Holdings">
<ss:Table>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">06-Oct-2020</ss:Data>
</ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">iShares Russell Top 200 Value ETF</ss:Data>
</ss:Cell>
</ss:Row>
.
.
.

Or you can download the full xml here

Ultimately I will need to convert the file into DataFrame, but now I am open to any solutions, maybe convert to csv first. Can anyone help?

3
  • show us the code you've tried Commented Oct 9, 2020 at 8:37
  • import xml.etree.ElementTree as et response = requests.get(url, headers=headers) parsed = et.parse(str(response.text)) print(parsed.getroot()) Commented Oct 9, 2020 at 8:41
  • and this soup = BeautifulSoup(str(response.text), 'xml') workbook = [] for sheet in soup.findAll('Worksheet'): sheet_as_list = [] for row in sheet.findAll('Row'): row_as_list = [] for cell in row.findAll('Cell'): row_as_list.append(cell.Data.text) sheet_as_list.append(row_as_list) workbook.append(sheet_as_list) print(len(workbook)) it prints 0 Commented Oct 9, 2020 at 8:46

2 Answers 2

4

Another method.

from simplified_scrapy import SimplifiedDoc, utils, req
xml = req.get(
    'https://www.ishares.com/us/products/239722/ishares-russell-top-200-value-etf/1521942788811.ajax?fileType=xls&fileName=iShares-Russell-Top-200-Value-ETF_fund&dataType=fund'
)
xml = xml.read().decode('utf-8')
doc = SimplifiedDoc(xml)
worksheets = doc.selects('ss:Worksheet') # Get all Worksheets
for worksheet in worksheets:
    rows = worksheet.selects('ss:Row').selects('ss:Cell>text()') # Get all rows
    utils.save2csv(worksheet['ss:Name'] + '.csv', rows) # Save data to csv

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. It works like a charm. At least I can get the worksheets as csv and can start from here.
1

Thanks for the answer above, it gave me some insights on a better solution.

Turned out I was able to parse the response with beautifulsoup with the correct decoding scheme ('utf-8'). Another thing is beautifulsoup was not able to pickup the tag name <ss:Worksheet>, but able to pickup <ss:worksheet>.

In this case I would not have to import another module.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.