0

I have two tables (csv files pulled from a database), one with orders and the second with items, which have a relation to the orders table. I need to build an XML file from these two files to have this kind of structure (simplified due to readability):

<ORDERS>
    <ORDER>
        <ORDER_ID>11039515178</ORDER_ID>
        <CUSTOMER_ID>394556458</CUSTOMER_ID>
        <ITEMS>
            <ITEM>
                <PRODUCT_ID>1401817</PRODUCT_ID>
                <AMOUNT>2</AMOUNT>
            </ITEM>
            <ITEM>
                <PRODUCT_ID>1138857</PRODUCT_ID>
                <AMOUNT>10</AMOUNT>
            </ITEM>
            <ITEM>
                <PRODUCT_ID>4707595</PRODUCT_ID>
                <AMOUNT>15</AMOUNT>
            </ITEM>
        </ITEMS>
    </ORDER>
</ORDERS>

I use this code to generate the XML object. It's striped down to the main structure of the code, so it's easily readable:

import xml.etree.ElementTree as ET
import pandas as pd

order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)

# create XML
xml_orrder = ET.Element('ORDERS')
for row in order.itertuples():
    item = ET.SubElement(xml_orrder, 'ORDER')

    o_id = ET.Element('ORDER_ID')
    o_id.text = row.order_id
    item.append(o_id)

    customer = ET.Element('CUSTOMER_ID')
    customer.text = row.customer_id
    item.append(customer)

    order_item_id = order_item[order_item['order_id'] == row.order_id]

    items = ET.SubElement(item, 'ITEMS')
    for order_row in order_item_id.itertuples():
        single_item = ET.SubElement(items, 'ITEM')

        item_id = ET.Element('PRODUCT_ID')
        item_id.text = order_row.product_id
        single_item.append(item_id)

        quantity = ET.Element('AMOUNT')
        quantity.text = order_row.quantity_ordered
        single_item.append(quantity)

My problem here is that it runs unbelievably long (around 15 minutes per 1000 orders and each order having like 20 items). I guess I'm doing something wrong here but I'm not able to find out. Is there a way to speed it up? Use another library? I've tried using itertuples() instead of iterrows(). But this wasn't very helpful.

EDIT:

This is how my data looks like:

order = pd.DataFrame({"order_id": range(1000000,1000010,1),
                         "customer_id": np.random.RandomState(0).randint(1000,2000,10)})

order_item = pd.DataFrame({"order_id": np.random.RandomState(0).randint(1000000,1000010,100),
                         "product_id": np.random.RandomState(0).randint(1000,2000,100),
                         "amount": np.random.RandomState(0).randint(1,100,100)})
order_item.sort_values(by="order_id",inplace=True,ignore_index=True)
5
  • 1
    There is surely a way to merge your dataframes on order_id, transform (grouping/exploding or whatever fits your data structure) and then export the relevant columns to xml. All without ET. You should provide excerpts of your csv/df if you need more guidance. Commented May 4, 2022 at 14:20
  • OK, I see, that looks good. I'm going to try this one! Commented May 4, 2022 at 14:30
  • @Tranbi I read the documentation and tried it out, but I think I'm not able to create those nested ITEM tags with this approach Commented May 4, 2022 at 17:00
  • Have you tried with Multiindex? I'm AFK right now so I cannot help you much. Try updating your question with samples of your dfs. It will greatly improve the likelihood of getting a useful answer. Commented May 4, 2022 at 17:06
  • For writing html or xml, its usually faster to do it textually, rather than building a DOM tree. Good ole print or a templating system like jinja2 are good options. Commented May 4, 2022 at 17:45

4 Answers 4

1

When writing XML or HTML, its frequently faster to write textually rather than adding the expense of building an in-memory XML document. You can write the file directly or use a templating language such as jinja 2. Following is an example using multiline f-strings to write a document with the spacing you want. Since XML doesn't care about newlines or pretty printing, I'd tend to write without the extra spacing.

The code is a little ugly, but that's true for all templating, IMHO.

import pandas as pd

order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)

with open("out.xml", "w") as outfile:
    outfile.write("""\
<ORDERS>
""")
    for row in order.itertuples():
        outfile.write(
f"""\
    <ORDER>
        <ORDER_ID>{row.order_id}</ORDER_ID>
        <CUSTOMER_ID>f{row.customer_id)</CUSOMTER_ID>
""")

        outfile.write(f"""\
        <ITEMS>
""")

        order_item_id = order_item[order_item['order_id'] == row.order_id]
        for order_row in order_item_id.itertuples():
            outfile.write(f"""\
            <ITEM>
                <PRODUCT_ID>{order_row.product_id}</PRODUCT_ID>
                <AMOUNT>{order_row.quantity_ordered}</AMOUNT>
            </ITEM>
"""
        outfile.write("""\
        </ITEMS>
""")
    outfile.write("""\
    </ORDERS>
</ORDER>"""
Sign up to request clarification or add additional context in comments.

4 Comments

And it will get messier since there are more tags I need to include per order. But as you said, I don't need those extra spacings. But anyway thanks for that idea I'm gonna try it out.
Although the speed for all versions is quite the same, this approach seems to be the fastest one so far. Anyway, from all the test runs I've made it is obvious that the main thing that slows it down is the size of the input. Here is some comparison between the size of the input and speed for processing the first 1 000 orders: order: 8 000 rows, order_item: 186 000 rows --> 6 sec. order: 185 000 rows, order_item: 4 500 000 rows --> 150 sec.
In order to process the larger input (that's what I actualy want) it would take 7 hours, which is quite unacceptable. I always considered Pandas as the fastest library for this kind of tasks. So I'm not sure if I can get any better using another approach... @Tranbi
In this example, pandas may not be the best approach. It pulls the full dataset into memory only to iterate row by row. The csv module will read line at a time. Use of a solid state drive and writing to a different physical disk that reading can both speed things up when using csv.
0

I'm not sure what your data looks like, so I hope this will work for you, it took me seconds to process ~5000 rows:

import pandas as pd
import lxml.etree as et

df_order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
df_order_item = pd.read_csv("order_items.csv", encoding='utf8', keep_default_na=False, dtype=str)

new_orders = df_order.merge(df_order_item, 'left', left_on='order_id', right_on='order_id')

orders = et.Element('ORDERS')
for order_id in new_orders['order_id'].unique():
    rows = new_orders[new_orders['order_id'] == order_id]
    customer_id = int(rows['customer_id'].unique())
    order = et.SubElement(orders, 'ORDER')
    o_id = et.SubElement(order, 'ORDER_ID')
    o_id.text = order_id
    c_id = et.SubElement(order, 'CUSTOMER_ID')
    c_id.text = str(customer_id)
    items = et.SubElement(order, 'ITEMS')
    for product in rows.itertuples():
        item = et.SubElement(items, 'ITEM')
        p_id = et.SubElement(item, 'PRODUCT_ID')
        p_id.text = product.product_id
        amount = et.SubElement(item, 'AMOUNT')
        amount.text = product.quantity_ordered

1 Comment

Unfortunately, this isn't much faster. The input is quite huge: 185 000 rows in the orders table and 4 500 000 in the order_item table. It significantly slows down with the input being this large.
0

Apparently pandas to_xml doesn't handle this kind of hierarchy. You can write parts of it directly to the xml file and use to_xml on the grouped sub-df:

df = order.merge(order_item, on='order_id')

with open('output.xml', 'w') as f:
    f.write('<ORDERS>')

    for (ord_id, cust_id), sub_df in df.groupby(['order_id', 'customer_id']):
        f.write(f'\n<ORDER>\n<ORDER_ID>{ord_id}</ORDER_ID>\n<CUSTOMER_ID>{cust_id}</CUSTOMER_ID>\n')
        f.write(sub_df.to_xml(root_name='ITEMS', row_name='ITEM', xml_declaration=False, elem_cols=['product_id', 'amount'], index=False))
        f.write(f'\n</ORDER>')

    f.write('\n</ORDERS>')

Let us know if you notice any performance improvement!

Note: you can also choose your xml-parser with the kwarg parser= ('lxml' or 'etree')

Comments

0

I've tried some approaches mentioned above but one thing that significantly sped the whole process up was to make those nested <ITEMS> tags already in the database. We are using snowflake and I did a simple group by on the order_item table using the LISTAGG group by function:

CREATE OR REPLACE TABLE "wrk_order_item" AS
SELECT
    "order_id",
    '<ITEMS>' || LISTAGG('<ITEM><PRODUCT_ID>' || "product_id"  || '</PRODUCT_ID>'
        || '<AMOUNT>' || "quantity_ordered"  || '</AMOUNT>'
        || '<PRICE>' || "sell_price" || '</PRICE></ITEM>') || '</ITEMS>' AS "items"
FROM "ORDER_ITEM"
GROUP BY "order_id";

Joined it with the order table and removed the creation of the items dataframe in each iteration over the order table in the python script. Both codes (snowflake, python) now finishes in seconds.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.