python xml and csv extract

Question

i have 2 set of files - (1) CSV file (the main file) (2) XML file.

CSV file -

emp_id
1
2
3
4

XML File -

<employee>
    <emp id="1" />
    <emp id="2" active="yes">
        <tag k="age" v="55" />
    </emp>
    <emp id="3" active="yes">
        <tag k="name" v="scott" />
    </emp>
    <emp id="4" active="no">
        <tag k="address" v="Texas" />
    </emp>
    <emp id="5" gender="male"/>
    <emp id="8" />
    <emp id="9" />
    <emp id="10" />
    <emp id="11" />
</employee>

My objective is have a csv file where the emp_id from the csv file is matched with the XML file, and only the matched emp_id is created in the new csv file. I need 2 csv files like below -

1st file.

emp_id,active,gender
1,,
2,yes,
3,yes,
4,no,
5,,male

2nd file.

emp_id,key,value
2,age,55
3,name,scott
4,address,Texas

I can read the CSV file in pandas, and XML file in Python. But don't know how to combine them and extract keys and value from the XML

Any help is appreciated.

What is your current code? Please share it and explain what is the problem — balderman
– balderman, Commented Oct 1, 2020 at 8:46
The current code consists of reading the csv file and xml file only. I am struck in the logic of combining a csv file data with the xml file and extracting the information — user3470294
– user3470294, Commented Oct 1, 2020 at 8:53

dabingsou · Accepted Answer · 2020-10-03 03:02:47Z

Another method.

from simplified_scrapy import SimplifiedDoc, utils
empIds = ['1', '2', '3', '4', '5']
# empIds = [id.strip() for id in utils.getFileLines('your csv file path')[1:]]

xml = '''<employee>
    <emp id="1" />
    <emp id="2" active="yes">
        <tag k="age" v="55" />
    </emp>
    <emp id="3" active="yes">
        <tag k="name" v="scott" />
    </emp>
    <emp id="4" active="no">
        <tag k="address" v="Texas" />
    </emp>
    <emp id="5" gender="male"/>
    <emp id="8" />
    <emp id="9" />
    <emp id="10" />
    <emp id="11" />
</employee>
'''
# xml = utils.getFileContent('your xml file path')

rows1 = [['emp_id', 'active', 'gender']]
rows2 = [['emp_id', 'key', 'value']]
doc = SimplifiedDoc(xml)
for id in empIds:
    emp = doc.select('emp#' + id)
    if emp:
        rows1.append([id, emp.get('active'), emp.get('gender')])
        tags = emp.selects('tag')
        if tags:
            for tag in tags:
                rows2.append([id, tag['k'], tag['v']])

utils.save2csv('csv1.csv', rows1, newline='')
utils.save2csv('csv2.csv', rows2, newline='')

Result csv1:

emp_id,active,gender
1,,
2,yes,
3,yes,
4,no,
5,,male

Result csv2:

emp_id,key,value
2,age,55
3,name,scott
4,address,Texas

Andre S. · Accepted Answer · 2020-10-01 09:19:25Z

First df: Read XML and create DataFrame

import xmltodict
with open('XmlFile.xml') as fd:
    xmlfile = xmltodict.parse(fd.read())
df_xmlfile = pd.DataFrame(xmlfile["employee"]["emp"])
df_xmlfile.columns = [col.replace("@","") for col in df_xmlfile.columns]

Read CSV as DataFrame

df_csvfile = csvfile=pd.read_csv("CsvFile.txt")

Join boths dfs

df_first = df_csvfile.join(df_xmlfile[["active", "gender"]])

Second df: Get rows containing a tag, unpack those rows to get the keys and values and create second df

df_temp = df_xmlfile[~df_xmlfile["tag"].isna()][["id", "tag"]]
df_second = pd.DataFrame({"emp_id": df_temp["id"],
              "key": [row["@k"] for row in df_temp["tag"]],
              "value": [row["@v"] for row in df_temp["tag"]]})

Alexandra Dudkina · Accepted Answer · 2020-10-01 09:10:08Z

0

Lets create first dataframe:

# this dataframe should be read from csv in your case
df = pd.DataFrame({
    'emp_id':[1, 2, 3, 4, 5]
})

Than define XML:

xml = '''<employee>
    <emp id="1" />
    <emp id="2" active="yes">
        <tag k="age" v="55" />
    </emp>
    <emp id="3" active="yes">
        <tag k="name" v="scott" />
    </emp>
    <emp id="4" active="no">
        <tag k="address" v="Texas" />
    </emp>
    <emp id="5" gender="male"/>
    <emp id="8" />
    <emp id="9" />
    <emp id="10" />
    <emp id="11" />
</employee>'''

Read data from XML:

d = defaultdict(list)
# here we read XML from string, in your case it should be read from file
root = et.fromstring(xml)
emps = root.xpath('//employee/emp')
# iterate over elements "emp" and extract data
for emp in emps:
  # here we extract attributes id, active and gender
  d['id'].append(int(emp.get('id')))
  d['active'].append(None if emp.get('active') is None else emp.get('active'))
  d['gender'].append(None if emp.get('gender') is None else emp.get('gender'))
  # here we extract age
  age = emp.find('./tag[@k="age"]')
  d['age'].append(None if age is None else age.get('v'))
  # here we extract address
  address = emp.find('./tag[@k="address"]')
  d['address'].append(None if address is None else address.get('v'))

Than create dataframe from dict:

df_xml = pd.DataFrame(d)

Than merge data using emp_id / id columns:

df_merged = pd.merge(df, df_xml, left_on = 'emp_id', right_on='id', how = 'inner')
del df_merged['id']

df_merged.head(10)

Output:

answered Oct 1, 2020 at 9:10

Alexandra Dudkina

4,5123 gold badges18 silver badges29 bronze badges

2 Comments

user3470294 Over a year ago

thanks. but there are multiple key values. I have shown just a sample. I dont want the output to have many columns.

Alexandra Dudkina Over a year ago

In that case you can leave in the loop for emp in emps: only first line: d['id'].append(int(emp.get('id'))). That will create a dataframe with only one column.

Collectives™ on Stack Overflow

python xml and csv extract

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related