Create a single XML file from a pandas data frame

Question

I want to create a specifically formatted XML file from a given pandas data-frame. My data-frame looks something like this -

Doc_ID      Doc_Name       Doc_Category

abc123      aaa111            c1
abc456      aaa222            c2

And I want to format such a dataset having 10k rows into a single XML file having the format -

<DOC>
<DOCNO> abc123 </DOCNO>
<TEXT> aaa111 + c1 </TEXT>  ### Combines strings from 2 columns
</DOC>

<DOC>
<DOCNO> abc456 </DOCNO>
<TEXT> aaa222 + c2 </TEXT>  ### Combines strings from 2 columns
</DOC>

I was trying to use something similar to this, but I unable to combine them all into a single XML file.

for i,row in testdoc.iterrows():
    xml =['<DOC>']
    xml.append('<{0}>{1}</{0}>'.format("DocNO", row["Doc_ID"]))
    xml.append('<{0}>{1}</{0}>'.format("Text", row["Doc_Name"]+row['Doc_Category']))
    xml.append('</DOC>')

How can I go about doing this? It would be nice to have an invalid character handler too.

Thanks!

balderman · Accepted Answer · 2020-10-07 12:39:27Z

2

Try

import pandas as pd

df = pd.DataFrame(
    [{'doc_id': 1, 'doc_name': 'jack', 'doc_cat': '__abc__'}, {'doc_id': 11, 'doc_name': 'ben', 'doc_cat': '$$abc$$'}])
d = df.to_dict(orient='list')
xml = '<ROOT>'
for idx, x in enumerate(d['doc_id']):
    xml += '<DOC>'
    xml += f'<DOCNO>{x}</DOCNO>'
    xml += f'<TEXT>{d["doc_cat"][idx]}{d["doc_name"][idx]}</TEXT>'
    xml += '</DOC>'
xml += '</ROOT>'
print(xml)

output

<ROOT>
    <DOC>
        <DOCNO>1</DOCNO>
        <TEXT>__abc__jack</TEXT>
    </DOC>
    <DOC>
        <DOCNO>11</DOCNO>
        <TEXT>$$abc$$ben</TEXT>
    </DOC>
</ROOT>

answered Oct 7, 2020 at 12:39

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alexandra Dudkina · Accepted Answer · 2020-10-07 12:27:56Z

1

import pandas as pd

df = pd.DataFrame({
    'Doc_ID': ['abc123', 'abc456'],
    'Doc_Name': ['aaa111', 'aaa222'],
    'Doc_Category': ['c1', 'c2']
})

def func(row):
    xml = ['  <DOC>']
    xml.append('    <DOCNO>{0}</DOCNO>'.format(row['Doc_ID']))
    xml.append('    <TEXT>{0}{1}</TEXT>'.format(row['Doc_Name'], row['Doc_Category']))
    xml.append('  </DOC>\n')
    return '\n'.join(xml)

xml = '<DOCS>\n' + ''.join(df.apply(func, axis = 1)) + '</DOCS>'
print(xml)

Output:

<DOCS>
  <DOC>
    <DOCNO>abc123</DOCNO>
    <TEXT>aaa111c1</TEXT>
  </DOC>
  <DOC>
    <DOCNO>abc456</DOCNO>
    <TEXT>aaa222c2</TEXT>
  </DOC>
</DOCS>

answered Oct 7, 2020 at 12:27

Alexandra Dudkina

4,5123 gold badges18 silver badges29 bronze badges

5 Comments

Soumya Ranjan Sahoo Over a year ago

Thank you very much.

Soumya Ranjan Sahoo Over a year ago

Also, I keep on getting the following error while running my script with the above function:UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in position 6716874: character maps to <undefined>. Is there a way to simply ignore such characters?

Alexandra Dudkina Over a year ago

Occurs error when calling the function or when writing output to file?

Soumya Ranjan Sahoo Over a year ago

While writing the output to the file.

Alexandra Dudkina Over a year ago

Try to specify encoding when opening file: with open('filename', 'w', encoding='utf-8') as f:

Collectives™ on Stack Overflow

Create a single XML file from a pandas data frame

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related