XML to csv conversion using python xml library

Question

Sorry if this question is already asked. I have below XML file for which i want to convert it into CSV or excel. Here i want to extract NodeName and its child DestIPAddress under IpRoutelist. and Value under Custom/Name tag

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
    <soap:Body>
        <EnbConfigGetResponse xmlns="http://Airspan.Netspan.WebServices">
            <EnbConfigGetResult>
                <ErrorCode>OK</ErrorCode>
                <NodeResult>
                    <NodeResultCode>OK</NodeResultCode>
                    <NodeName>IMUMB0899</NodeName>
                    <NodeDetail>
                        <Custom>
                            <Name>Circle</Name>
                            <Value>MU</Value>
                        </Custom>
                        <Custom>
                            <Name>GW VLAN 601</Name>
                            <Value>2405:200:101::</Value>
                        </Custom>
                        <Custom>
                            <Name>GW VLAN 602</Name>
                            <Value>2405:200:104::</Value>
                        </Custom>
                    </NodeDetail>
                    <EnbConfig>
                        <Name>IMUMB0899</Name>
                        <Hardware>1000 SFP</Hardware>
                        <Description>TT</Description>
                        <Site>DND</Site>
                        <Region>DND</Region>
                        <Altitude>0</Altitude>
                        <NbifEventAlarmForwarding>Enabled</NbifEventAlarmForwarding>
                        <ENodeBType>Macro</ENodeBType>
                        <ENodeBID>397063</ENodeBID>
                        <M1SubnetMask>120</M1SubnetMask>
                        <IpRouteList>
                            <IpRoute>
                                <DestIpAddress>172.172.6.20</DestIpAddress>
                                <IpSubnetMask>255.255.255.255</IpSubnetMask>
                                <GatewayIpAddress>172.21.200.1</GatewayIpAddress>
                            </IpRoute>
                            <IpRoute>
                                <DestIpAddress>2405:20:1::</DestIpAddress>
                                <IpSubnetMask>40</IpSubnetMask>
                                <GatewayIpAddress>2405:20:101:4:7:2:61:1</GatewayIpAddress>
                            </IpRoute>
                        </IpRouteList>
                <NodeResult>
     </EnbConfigGetResult>
        </EnbConfigGetResponse>
    </soap:Body>
</soap:Envelope>

I tried below code which extract the Name and IProute, but when i try to merge tham only one IP route i get against NodeName but two are available.

from bs4 import BeautifulSoup
import pandas as pd
import lxml
import xml.etree.cElementTree
import openpyxl
import inspect
import os

sites = "xml"

with open(sites, "r",encoding='unicode_escape') as f:
    xml_data = f.read()

soup = BeautifulSoup(xml_data, "xml")

tag1 = input("Enter tagname1:")
tag2 = input("Enter tagname2:")

data = []
dd = []

for td in soup.find_all(tag1):
    data.append({"NodeName": td.text})
    
for iproute in soup.find_all(tag2):
    dd.append({"IpRoute": iproute.text})

df1 = pd.DataFrame(data)
df2 = pd.DataFrame(dd)
    
df = pd.merge(df1,df2,left_index=True, right_index=True)
    
df.to_excel(sites + '.xlsx', sheet_name='Detail', index = False)
print("*************Done*************")

Expected Output:

What do you enter for tag1 and tag2? Also, why are you storing your results in two separate data structures (data and dd), then merging them later? Don't you want to keep everything together? — MattDMo
– MattDMo, Commented Sep 18, 2020 at 18:49
Your xml is invalid; can you edit your question and correct it? — Jack Fleeting
– Jack Fleeting, Commented Sep 18, 2020 at 18:50
@MattDMo i need the output as shown in snapshot. And i enter tag1=NodeName and tag2 = DestIpAddress — Sonu
– Sonu, Commented Sep 18, 2020 at 18:54

dabingsou · Accepted Answer · 2020-10-08 02:37:57Z

4

Another method.

from simplified_scrapy import SimplifiedDoc, utils, req
# xml = utils.getFileContent('file.xml')
xml = ''' Your xml string'''

doc = SimplifiedDoc(xml)
lstNodeResult = doc.selects('NodeResult')
data = [['NodeName','DestIpAddress','GatewayIpAddress','value1','value2','value3']]
for result in lstNodeResult:
    lstCustom = result.selects('NodeDetail>Custom')
    if lstCustom:
        lstCustom = lstCustom.Value.text
    NodeName = result.NodeName.text
    lstIpRoute = result.IpRoutes
    for IpRoute in lstIpRoute:
        row = [NodeName,IpRoute.DestIpAddress.text,IpRoute.GatewayIpAddress.text]
        if lstCustom: row.extend(lstCustom)
        data.append(row)
# print (data)
utils.save2csv('test.csv',data)

# Or
data = {
    'NodeName':lstNodeResult.NodeName.text,
    'DestIpAddress':lstNodeResult.select('IpRoute>DestIpAddress>text()'),
    'GatewayIpAddress':lstNodeResult.select('IpRoute>GatewayIpAddress>text()')
}
# print (data)

Result:

.......
I-MU-NVMB-OSC-0900-SMC004,2405:200:310:5a::,2405:200:101:500:7:2:602:3503
I-MU-NVMB-OSC-0900-SMC004,2405:200:310:1::,2405:200:101:500:7:2:602:3503
I-MU-NVMB-ISC-0181-SWC0002,2405:200:310:1::,2405:200:101:500:7:2:602:5d03,MU,2405:200:101:500:7:2:601:5d03,2405:200:101:500:7:2:602:5d03
I-MU-NVMB-ISC-0181-SWC0002,2405:200:311:2::,2405:200:101:500:7:2:602:5d03,MU,2405:200:101:500:7:2:601:5d03,2405:200:101:500:7:2:602:5d03
I-MU-NVMB-ISC-0181-SWC0002,2405:200:310:a152::,2405:200:101:500:7:2:602:5d03,MU,2405:200:101:500:7:2:601:5d03,2405:200:101:500:7:2:602:5d03
.......

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Delete blank lines.

def delEmptyRows(name, encoding="utf-8"):
    lines = utils.getFileLines(name, encoding=encoding)
    lines = [line for line in lines if line.strip()!='']
    utils.saveFile(name, "".join(lines), encoding=encoding)
# use
delEmptyRows('test.csv')

Method of obtaining IBridge2RfStatsRow data.

from simplified_scrapy import SimplifiedDoc, utils, req
xml = req.get('https://pastebin.com/raw/SWhDM1zq')

rows = []
doc = SimplifiedDoc(xml)
lstIBridge2RfStatsRow = doc.selects('IBridge2RfStatsRow').children
# Add header
header = []
for col in lstIBridge2RfStatsRow[0]:
    header.append(col.tag)
rows.append(header)

# Generate line
for row in lstIBridge2RfStatsRow:
    rows.append([col.text for col in row])

utils.save2csv('IBridge2RfStatsRow.csv',rows)

edited Oct 8, 2020 at 2:37

answered Sep 19, 2020 at 23:51

dabingsou

2,4691 gold badge7 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Sonu Over a year ago

Hey @dabingsou Thanks for your response, but there are multiple destIP against each NodeName., how to make that work?

dabingsou Over a year ago

@SunilSharma The number of these destIP is not fixed. What do you want to do with it?

dabingsou Over a year ago

@SunilSharma This problem does exist in windows. I haven't found a good way. Temporarily added a method to delete empty lines.

dabingsou Over a year ago

@SunilSharma I have updated the answer and you can see if it is what you want.

dabingsou Over a year ago

@SunilSharma name = doc.select('StatisticsResult>Name>text()')

|

Sowjanya R Bhat · Accepted Answer · 2020-09-18 20:05:06Z

1

There are certain things you have missed to consider in your code - like you are just doing a soup.find_all('NodeName') - which gives all nodenames present in the xml and soup.find_all('DestIPAddress') gives all destipaddresses present in the xml - how will you map them if you do it that way ?

Try this code, I have made outer loop to go through each noderesult - and inner loop to go through each destipaddress , i am assuming that nodename will be single for each noderesult hence using that same name for every destipaddress found in a particular noderesult :

tag1 = "NodeName"
tag2 = "DestIpAddress"

tag1_list = []
tag2_list = []

for node_res in soup.find_all('NodeResult'):
    
  for tg2 in node_res.find_all(tag2):
      tag1_list.append(soup.find(tag1).text)
      tag2_list.append(tg2.text)

print(tag1_list)
print(tag2_list)

df = pd.DataFrame({
  'NodeNames': tag1_list,
  'IpRoute': tag2_list
})
    
print(df)

DF-output :

   NodeNames       IpRoute
0  IMUMB0899  172.172.6.20
1  IMUMB0899   2405:20:1::

You can add the write-excel line in the end .

edited Sep 18, 2020 at 20:05

answered Sep 18, 2020 at 19:55

Sowjanya R Bhat

1,18610 silver badges21 bronze badges

7 Comments

Sonu Over a year ago

Thanks it worked well, i missed one thing i also need GatewayIpAddress against each nodename. And Nodename is single for each Noderesult

Sowjanya R Bhat Over a year ago

hope you got an idea how to add new items from xml to df

Sonu Over a year ago

Thanks, i got the idea i worked on pandas, but xml is bit new for me

Sonu Over a year ago

The NodeName in output is coming same for all xml, its taking first NodeName for full xml.

Sowjanya R Bhat Over a year ago

there's a mistake in my answer - tag1_list.append(soup.find(tag1).text) should be changed to tag1_list.append(node_res.find(tag1).text) - then you'll get the proper node-name .

|

Jack Fleeting · Accepted Answer · 2020-09-18 23:11:57Z

1

Since you are dealing with xml and importing lxml, you might as well use it with xpath.

Note that the xml in your question is invalid (even after the editing) and this answer assumes the example was properly edited to be valid.

from lxml import etree
import pandas as pd

soap = """[your xml above, fixed]"""
doc = etree.XML(soap.encode())
columns = ['NodeName','DesIPAdd','GIPAdd','Value1','Value2','Value3']
ns = {'xx' : "http://Airspan.Netspan.WebServices"}
nodes = doc.xpath('//xx:NodeResult', namespaces=ns)
rows = []
for node in nodes:        
    sub_nodes = node.xpath('//xx:EnbConfig/xx:IpRouteList/xx:IpRoute', namespaces=ns)    
    for sub_node in sub_nodes:
        row = []
        row.append(node.xpath('//xx:NodeName/text()', namespaces=ns)[0])        
        row.append(sub_node.xpath('.//xx:DestIpAddress/text()', namespaces=ns)[0])
        row.append(sub_node.xpath('.//xx:GatewayIpAddress/text()', namespaces=ns)[0])
        row.extend(node.xpath('//xx:NodeDetail/xx:Custom/xx:Value/text()', namespaces=ns))
        rows.append(row)
pd.DataFrame(rows,columns=columns)

Output:

      NodeName    DesIPAdd      GIPAdd              Value1  Value2      Value3
0   IMUMB0899   172.172.6.20    172.21.200.1           MU   2405:200:101::  2405:200:104::
1   IMUMB0899   2405:20:1::     2405:20:101:4:7:2:61:1  MU  2405:200:101::  2405:200:104::

answered Sep 18, 2020 at 23:11

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

3 Comments

Sonu Over a year ago

I tried this with my xml file, instead of 997 rows, i get 60894 rows, so something happen in the loop,and NodeName is same for all rows. i have loaded part of xml on below 0bin.net/paste/HOf-mVMg#bLRgpGfbezBNX8wrO-MW/…

Jack Fleeting Over a year ago

@SunilSharma I took a quick look at the xml; if I were you, I would give up on the idea of converting it to pandas (or whatever). I would stick with the native xml environment, learn xpath and xquery, use tools like BaseX and go from there.

Sonu Over a year ago

Yeah but this the requirement, thats why need to convert it into csv.

Collectives™ on Stack Overflow

XML to csv conversion using python xml library

3 Answers 3

16 Comments

7 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

16 Comments

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related