Fastest way to generate XML in Python

Question

I have two tables (csv files pulled from a database), one with orders and the second with items, which have a relation to the orders table. I need to build an XML file from these two files to have this kind of structure (simplified due to readability):

<ORDERS>
    <ORDER>
        <ORDER_ID>11039515178</ORDER_ID>
        <CUSTOMER_ID>394556458</CUSTOMER_ID>
        <ITEMS>
            <ITEM>
                <PRODUCT_ID>1401817</PRODUCT_ID>
                <AMOUNT>2</AMOUNT>
            </ITEM>
            <ITEM>
                <PRODUCT_ID>1138857</PRODUCT_ID>
                <AMOUNT>10</AMOUNT>
            </ITEM>
            <ITEM>
                <PRODUCT_ID>4707595</PRODUCT_ID>
                <AMOUNT>15</AMOUNT>
            </ITEM>
        </ITEMS>
    </ORDER>
</ORDERS>

I use this code to generate the XML object. It's striped down to the main structure of the code, so it's easily readable:

import xml.etree.ElementTree as ET
import pandas as pd

order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)

# create XML
xml_orrder = ET.Element('ORDERS')
for row in order.itertuples():
    item = ET.SubElement(xml_orrder, 'ORDER')

    o_id = ET.Element('ORDER_ID')
    o_id.text = row.order_id
    item.append(o_id)

    customer = ET.Element('CUSTOMER_ID')
    customer.text = row.customer_id
    item.append(customer)

    order_item_id = order_item[order_item['order_id'] == row.order_id]

    items = ET.SubElement(item, 'ITEMS')
    for order_row in order_item_id.itertuples():
        single_item = ET.SubElement(items, 'ITEM')

        item_id = ET.Element('PRODUCT_ID')
        item_id.text = order_row.product_id
        single_item.append(item_id)

        quantity = ET.Element('AMOUNT')
        quantity.text = order_row.quantity_ordered
        single_item.append(quantity)

My problem here is that it runs unbelievably long (around 15 minutes per 1000 orders and each order having like 20 items). I guess I'm doing something wrong here but I'm not able to find out. Is there a way to speed it up? Use another library? I've tried using itertuples() instead of iterrows(). But this wasn't very helpful.

EDIT:

This is how my data looks like:

order = pd.DataFrame({"order_id": range(1000000,1000010,1),
                         "customer_id": np.random.RandomState(0).randint(1000,2000,10)})

order_item = pd.DataFrame({"order_id": np.random.RandomState(0).randint(1000000,1000010,100),
                         "product_id": np.random.RandomState(0).randint(1000,2000,100),
                         "amount": np.random.RandomState(0).randint(1,100,100)})
order_item.sort_values(by="order_id",inplace=True,ignore_index=True)

There is surely a way to merge your dataframes on order_id, transform (grouping/exploding or whatever fits your data structure) and then export the relevant columns to xml. All without ET. You should provide excerpts of your csv/df if you need more guidance. — Tranbi
– Tranbi, Commented May 4, 2022 at 14:20
@Tranbi I read the documentation and tried it out, but I think I'm not able to create those nested ITEM tags with this approach — MrZH6
– MrZH6, Commented May 4, 2022 at 17:00
Have you tried with Multiindex? I'm AFK right now so I cannot help you much. Try updating your question with samples of your dfs. It will greatly improve the likelihood of getting a useful answer. — Tranbi
– Tranbi, Commented May 4, 2022 at 17:06
For writing html or xml, its usually faster to do it textually, rather than building a DOM tree. Good ole print or a templating system like jinja2 are good options. — tdelaney
– tdelaney, Commented May 4, 2022 at 17:45

tdelaney · Accepted Answer · 2022-05-04 18:06:42Z

1

When writing XML or HTML, its frequently faster to write textually rather than adding the expense of building an in-memory XML document. You can write the file directly or use a templating language such as jinja 2. Following is an example using multiline f-strings to write a document with the spacing you want. Since XML doesn't care about newlines or pretty printing, I'd tend to write without the extra spacing.

The code is a little ugly, but that's true for all templating, IMHO.

import pandas as pd

order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)

with open("out.xml", "w") as outfile:
    outfile.write("""\
<ORDERS>
""")
    for row in order.itertuples():
        outfile.write(
f"""\
    <ORDER>
        <ORDER_ID>{row.order_id}</ORDER_ID>
        <CUSTOMER_ID>f{row.customer_id)</CUSOMTER_ID>
""")

        outfile.write(f"""\
        <ITEMS>
""")

        order_item_id = order_item[order_item['order_id'] == row.order_id]
        for order_row in order_item_id.itertuples():
            outfile.write(f"""\
            <ITEM>
                <PRODUCT_ID>{order_row.product_id}</PRODUCT_ID>
                <AMOUNT>{order_row.quantity_ordered}</AMOUNT>
            </ITEM>
"""
        outfile.write("""\
        </ITEMS>
""")
    outfile.write("""\
    </ORDERS>
</ORDER>"""

answered May 4, 2022 at 18:06

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

MrZH6 Over a year ago

And it will get messier since there are more tags I need to include per order. But as you said, I don't need those extra spacings. But anyway thanks for that idea I'm gonna try it out.

MrZH6 Over a year ago

Although the speed for all versions is quite the same, this approach seems to be the fastest one so far. Anyway, from all the test runs I've made it is obvious that the main thing that slows it down is the size of the input. Here is some comparison between the size of the input and speed for processing the first 1 000 orders: order: 8 000 rows, order_item: 186 000 rows --> 6 sec. order: 185 000 rows, order_item: 4 500 000 rows --> 150 sec.

MrZH6 Over a year ago

In order to process the larger input (that's what I actualy want) it would take 7 hours, which is quite unacceptable. I always considered Pandas as the fastest library for this kind of tasks. So I'm not sure if I can get any better using another approach... @Tranbi

tdelaney Over a year ago

In this example, pandas may not be the best approach. It pulls the full dataset into memory only to iterate row by row. The csv module will read line at a time. Use of a solid state drive and writing to a different physical disk that reading can both speed things up when using csv.

Ze'ev Ben-Tsvi · Accepted Answer · 2022-05-04 17:30:56Z

0

I'm not sure what your data looks like, so I hope this will work for you, it took me seconds to process ~5000 rows:

import pandas as pd
import lxml.etree as et

df_order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
df_order_item = pd.read_csv("order_items.csv", encoding='utf8', keep_default_na=False, dtype=str)

new_orders = df_order.merge(df_order_item, 'left', left_on='order_id', right_on='order_id')

orders = et.Element('ORDERS')
for order_id in new_orders['order_id'].unique():
    rows = new_orders[new_orders['order_id'] == order_id]
    customer_id = int(rows['customer_id'].unique())
    order = et.SubElement(orders, 'ORDER')
    o_id = et.SubElement(order, 'ORDER_ID')
    o_id.text = order_id
    c_id = et.SubElement(order, 'CUSTOMER_ID')
    c_id.text = str(customer_id)
    items = et.SubElement(order, 'ITEMS')
    for product in rows.itertuples():
        item = et.SubElement(items, 'ITEM')
        p_id = et.SubElement(item, 'PRODUCT_ID')
        p_id.text = product.product_id
        amount = et.SubElement(item, 'AMOUNT')
        amount.text = product.quantity_ordered

answered May 4, 2022 at 17:30

Ze'ev Ben-Tsvi

1,4321 gold badge5 silver badges8 bronze badges

1 Comment

MrZH6 Over a year ago

Unfortunately, this isn't much faster. The input is quite huge: 185 000 rows in the orders table and 4 500 000 in the order_item table. It significantly slows down with the input being this large.

Tranbi · Accepted Answer · 2022-05-05 07:14:45Z

Apparently pandas to_xml doesn't handle this kind of hierarchy. You can write parts of it directly to the xml file and use to_xml on the grouped sub-df:

df = order.merge(order_item, on='order_id')

with open('output.xml', 'w') as f:
    f.write('<ORDERS>')

    for (ord_id, cust_id), sub_df in df.groupby(['order_id', 'customer_id']):
        f.write(f'\n<ORDER>\n<ORDER_ID>{ord_id}</ORDER_ID>\n<CUSTOMER_ID>{cust_id}</CUSTOMER_ID>\n')
        f.write(sub_df.to_xml(root_name='ITEMS', row_name='ITEM', xml_declaration=False, elem_cols=['product_id', 'amount'], index=False))
        f.write(f'\n</ORDER>')

    f.write('\n</ORDERS>')

Let us know if you notice any performance improvement!

Note: you can also choose your xml-parser with the kwarg parser= ('lxml' or 'etree')

MrZH6 · Accepted Answer · 2022-05-06 08:01:02Z

I've tried some approaches mentioned above but one thing that significantly sped the whole process up was to make those nested <ITEMS> tags already in the database. We are using snowflake and I did a simple group by on the order_item table using the LISTAGG group by function:

CREATE OR REPLACE TABLE "wrk_order_item" AS
SELECT
    "order_id",
    '<ITEMS>' || LISTAGG('<ITEM><PRODUCT_ID>' || "product_id"  || '</PRODUCT_ID>'
        || '<AMOUNT>' || "quantity_ordered"  || '</AMOUNT>'
        || '<PRICE>' || "sell_price" || '</PRICE></ITEM>') || '</ITEMS>' AS "items"
FROM "ORDER_ITEM"
GROUP BY "order_id";

Joined it with the order table and removed the creation of the items dataframe in each iteration over the order table in the python script. Both codes (snowflake, python) now finishes in seconds.

Collectives™ on Stack Overflow

Fastest way to generate XML in Python

4 Answers 4

4 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related