I have two tables (csv files pulled from a database), one with orders and the second with items, which have a relation to the orders table. I need to build an XML file from these two files to have this kind of structure (simplified due to readability):
<ORDERS>
<ORDER>
<ORDER_ID>11039515178</ORDER_ID>
<CUSTOMER_ID>394556458</CUSTOMER_ID>
<ITEMS>
<ITEM>
<PRODUCT_ID>1401817</PRODUCT_ID>
<AMOUNT>2</AMOUNT>
</ITEM>
<ITEM>
<PRODUCT_ID>1138857</PRODUCT_ID>
<AMOUNT>10</AMOUNT>
</ITEM>
<ITEM>
<PRODUCT_ID>4707595</PRODUCT_ID>
<AMOUNT>15</AMOUNT>
</ITEM>
</ITEMS>
</ORDER>
</ORDERS>
I use this code to generate the XML object. It's striped down to the main structure of the code, so it's easily readable:
import xml.etree.ElementTree as ET
import pandas as pd
order = pd.read_csv("order.csv", encoding='utf8', keep_default_na=False, dtype=str)
order_item = pd.read_csv("order_item.csv", encoding='utf8', keep_default_na=False, dtype=str)
# create XML
xml_orrder = ET.Element('ORDERS')
for row in order.itertuples():
item = ET.SubElement(xml_orrder, 'ORDER')
o_id = ET.Element('ORDER_ID')
o_id.text = row.order_id
item.append(o_id)
customer = ET.Element('CUSTOMER_ID')
customer.text = row.customer_id
item.append(customer)
order_item_id = order_item[order_item['order_id'] == row.order_id]
items = ET.SubElement(item, 'ITEMS')
for order_row in order_item_id.itertuples():
single_item = ET.SubElement(items, 'ITEM')
item_id = ET.Element('PRODUCT_ID')
item_id.text = order_row.product_id
single_item.append(item_id)
quantity = ET.Element('AMOUNT')
quantity.text = order_row.quantity_ordered
single_item.append(quantity)
My problem here is that it runs unbelievably long (around 15 minutes per 1000 orders and each order having like 20 items). I guess I'm doing something wrong here but I'm not able to find out. Is there a way to speed it up? Use another library? I've tried using itertuples() instead of iterrows(). But this wasn't very helpful.
EDIT:
This is how my data looks like:
order = pd.DataFrame({"order_id": range(1000000,1000010,1),
"customer_id": np.random.RandomState(0).randint(1000,2000,10)})
order_item = pd.DataFrame({"order_id": np.random.RandomState(0).randint(1000000,1000010,100),
"product_id": np.random.RandomState(0).randint(1000,2000,100),
"amount": np.random.RandomState(0).randint(1,100,100)})
order_item.sort_values(by="order_id",inplace=True,ignore_index=True)
order_id, transform (grouping/exploding or whatever fits your data structure) and then export the relevant columns to xml. All without ET. You should provide excerpts of your csv/df if you need more guidance.ITEMtags with this approachprintor a templating system likejinja2are good options.