1

i have 2 set of files - (1) CSV file (the main file) (2) XML file.

CSV file -

emp_id
1
2
3
4

XML File -

<employee>
    <emp id="1" />
    <emp id="2" active="yes">
        <tag k="age" v="55" />
    </emp>
    <emp id="3" active="yes">
        <tag k="name" v="scott" />
    </emp>
    <emp id="4" active="no">
        <tag k="address" v="Texas" />
    </emp>
    <emp id="5" gender="male"/>
    <emp id="8" />
    <emp id="9" />
    <emp id="10" />
    <emp id="11" />
</employee>

My objective is have a csv file where the emp_id from the csv file is matched with the XML file, and only the matched emp_id is created in the new csv file. I need 2 csv files like below -

1st file.

emp_id,active,gender
1,,
2,yes,
3,yes,
4,no,
5,,male

2nd file.

emp_id,key,value
2,age,55
3,name,scott
4,address,Texas

I can read the CSV file in pandas, and XML file in Python. But don't know how to combine them and extract keys and value from the XML

Any help is appreciated.

2
  • What is your current code? Please share it and explain what is the problem Commented Oct 1, 2020 at 8:46
  • The current code consists of reading the csv file and xml file only. I am struck in the logic of combining a csv file data with the xml file and extracting the information Commented Oct 1, 2020 at 8:53

3 Answers 3

2

Another method.

from simplified_scrapy import SimplifiedDoc, utils
empIds = ['1', '2', '3', '4', '5']
# empIds = [id.strip() for id in utils.getFileLines('your csv file path')[1:]]

xml = '''<employee>
    <emp id="1" />
    <emp id="2" active="yes">
        <tag k="age" v="55" />
    </emp>
    <emp id="3" active="yes">
        <tag k="name" v="scott" />
    </emp>
    <emp id="4" active="no">
        <tag k="address" v="Texas" />
    </emp>
    <emp id="5" gender="male"/>
    <emp id="8" />
    <emp id="9" />
    <emp id="10" />
    <emp id="11" />
</employee>
'''
# xml = utils.getFileContent('your xml file path')

rows1 = [['emp_id', 'active', 'gender']]
rows2 = [['emp_id', 'key', 'value']]
doc = SimplifiedDoc(xml)
for id in empIds:
    emp = doc.select('emp#' + id)
    if emp:
        rows1.append([id, emp.get('active'), emp.get('gender')])
        tags = emp.selects('tag')
        if tags:
            for tag in tags:
                rows2.append([id, tag['k'], tag['v']])

utils.save2csv('csv1.csv', rows1, newline='')
utils.save2csv('csv2.csv', rows2, newline='')

Result csv1:

emp_id,active,gender
1,,
2,yes,
3,yes,
4,no,
5,,male

Result csv2:

emp_id,key,value
2,age,55
3,name,scott
4,address,Texas
Sign up to request clarification or add additional context in comments.

Comments

1

First df: Read XML and create DataFrame

import xmltodict
with open('XmlFile.xml') as fd:
    xmlfile = xmltodict.parse(fd.read())
df_xmlfile = pd.DataFrame(xmlfile["employee"]["emp"])
df_xmlfile.columns = [col.replace("@","") for col in df_xmlfile.columns]

Read CSV as DataFrame

df_csvfile = csvfile=pd.read_csv("CsvFile.txt")

Join boths dfs

df_first = df_csvfile.join(df_xmlfile[["active", "gender"]])

Second df: Get rows containing a tag, unpack those rows to get the keys and values and create second df

df_temp = df_xmlfile[~df_xmlfile["tag"].isna()][["id", "tag"]]
df_second = pd.DataFrame({"emp_id": df_temp["id"],
              "key": [row["@k"] for row in df_temp["tag"]],
              "value": [row["@v"] for row in df_temp["tag"]]})

Comments

0

Lets create first dataframe:

# this dataframe should be read from csv in your case
df = pd.DataFrame({
    'emp_id':[1, 2, 3, 4, 5]
})

Than define XML:

xml = '''<employee>
    <emp id="1" />
    <emp id="2" active="yes">
        <tag k="age" v="55" />
    </emp>
    <emp id="3" active="yes">
        <tag k="name" v="scott" />
    </emp>
    <emp id="4" active="no">
        <tag k="address" v="Texas" />
    </emp>
    <emp id="5" gender="male"/>
    <emp id="8" />
    <emp id="9" />
    <emp id="10" />
    <emp id="11" />
</employee>'''

Read data from XML:

d = defaultdict(list)
# here we read XML from string, in your case it should be read from file
root = et.fromstring(xml)
emps = root.xpath('//employee/emp')
# iterate over elements "emp" and extract data
for emp in emps:
  # here we extract attributes id, active and gender
  d['id'].append(int(emp.get('id')))
  d['active'].append(None if emp.get('active') is None else emp.get('active'))
  d['gender'].append(None if emp.get('gender') is None else emp.get('gender'))
  # here we extract age
  age = emp.find('./tag[@k="age"]')
  d['age'].append(None if age is None else age.get('v'))
  # here we extract address
  address = emp.find('./tag[@k="address"]')
  d['address'].append(None if address is None else address.get('v'))

Than create dataframe from dict:

df_xml = pd.DataFrame(d)

Than merge data using emp_id / id columns:

df_merged = pd.merge(df, df_xml, left_on = 'emp_id', right_on='id', how = 'inner')
del df_merged['id']

df_merged.head(10)

Output:

Output

2 Comments

thanks. but there are multiple key values. I have shown just a sample. I dont want the output to have many columns.
In that case you can leave in the loop for emp in emps: only first line: d['id'].append(int(emp.get('id'))). That will create a dataframe with only one column.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.