2

I want to create a specifically formatted XML file from a given pandas data-frame. My data-frame looks something like this -

Doc_ID      Doc_Name       Doc_Category

abc123      aaa111            c1
abc456      aaa222            c2

And I want to format such a dataset having 10k rows into a single XML file having the format -

<DOC>
<DOCNO> abc123 </DOCNO>
<TEXT> aaa111 + c1 </TEXT>  ### Combines strings from 2 columns
</DOC>

<DOC>
<DOCNO> abc456 </DOCNO>
<TEXT> aaa222 + c2 </TEXT>  ### Combines strings from 2 columns
</DOC>

I was trying to use something similar to this, but I unable to combine them all into a single XML file.

for i,row in testdoc.iterrows():
    xml =['<DOC>']
    xml.append('<{0}>{1}</{0}>'.format("DocNO", row["Doc_ID"]))
    xml.append('<{0}>{1}</{0}>'.format("Text", row["Doc_Name"]+row['Doc_Category']))
    xml.append('</DOC>')

How can I go about doing this? It would be nice to have an invalid character handler too.

Thanks!

2 Answers 2

2

Try

import pandas as pd

df = pd.DataFrame(
    [{'doc_id': 1, 'doc_name': 'jack', 'doc_cat': '__abc__'}, {'doc_id': 11, 'doc_name': 'ben', 'doc_cat': '$$abc$$'}])
d = df.to_dict(orient='list')
xml = '<ROOT>'
for idx, x in enumerate(d['doc_id']):
    xml += '<DOC>'
    xml += f'<DOCNO>{x}</DOCNO>'
    xml += f'<TEXT>{d["doc_cat"][idx]}{d["doc_name"][idx]}</TEXT>'
    xml += '</DOC>'
xml += '</ROOT>'
print(xml)

output

<ROOT>
    <DOC>
        <DOCNO>1</DOCNO>
        <TEXT>__abc__jack</TEXT>
    </DOC>
    <DOC>
        <DOCNO>11</DOCNO>
        <TEXT>$$abc$$ben</TEXT>
    </DOC>
</ROOT>
Sign up to request clarification or add additional context in comments.

Comments

1
import pandas as pd

df = pd.DataFrame({
    'Doc_ID': ['abc123', 'abc456'],
    'Doc_Name': ['aaa111', 'aaa222'],
    'Doc_Category': ['c1', 'c2']
})

def func(row):
    xml = ['  <DOC>']
    xml.append('    <DOCNO>{0}</DOCNO>'.format(row['Doc_ID']))
    xml.append('    <TEXT>{0}{1}</TEXT>'.format(row['Doc_Name'], row['Doc_Category']))
    xml.append('  </DOC>\n')
    return '\n'.join(xml)

xml = '<DOCS>\n' + ''.join(df.apply(func, axis = 1)) + '</DOCS>'
print(xml)

Output:

<DOCS>
  <DOC>
    <DOCNO>abc123</DOCNO>
    <TEXT>aaa111c1</TEXT>
  </DOC>
  <DOC>
    <DOCNO>abc456</DOCNO>
    <TEXT>aaa222c2</TEXT>
  </DOC>
</DOCS>

5 Comments

Thank you very much.
Also, I keep on getting the following error while running my script with the above function:UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in position 6716874: character maps to <undefined>. Is there a way to simply ignore such characters?
Occurs error when calling the function or when writing output to file?
While writing the output to the file.
Try to specify encoding when opening file: with open('filename', 'w', encoding='utf-8') as f:

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.