I have the following dataframe (already processed and cleaned to remove special chars, etc.).
| parent_id | members_id | item_id | item_name |
|---|---|---|---|
| par_100 | member1 | item1 | t shirt |
| par_100 | member1 | item2 | denims |
| par_102 | member2 | item3 | shirt |
| par_103 | member3 | item4 | shorts |
| par_103 | member3 | item5 | blouse |
| par_103 | member4 | item6 | sweater |
| par_103 | member4 | item7 | hoodie |
and following class structure
class Member:
def __init__(self, id):
self.member_id = id
self.items = []
class Item:
def __init__(self, id, name):
self.item_id = id
self.name = name
The number of rows in the dataframe is around 500K+ . I want to create a dictionary (or other structure) where "parent_id" is the primary key and the columns are mapped to the class objects. After creating the specified data structure. I will be performing some actions based on some business logic where I will have to loop through all the members.
First action is to create the data structure from dataframe. I have following code which does the job but it takes around 3 hours to process all the 500k+ rows.
# sorted_data is the dataframe mentioned above
parent_key_list = sorted_data['parent_id'].unique().tolist()
for index, parent_key in enumerate(parent_key_list):
temp_data = sorted_data.loc[sorted_data['parent_id'] == parent_key]
unique_members = temp_data["members_id"].unique()
for us in unique_members:
items = temp_data.loc[temp_data['members_id'] == us]
temp_member = Member(items[0]["members_id"])
for index, row in items.iterrows():
temp_member.items.append(Item(row["item_id"], row["item_name"]))
parent_dict[parent_key].append(temp_member)
Since .loc is very time expensive operation, I tried the same thing with numpy arrays but the performance was much worse. Is there a better approach to reduce the processing time?