3

I have a dataset where each row represents a unique sale. One column represents the items ordered in that sale, as CSV.

I need to separate the values in the "items ordered" column and for each item ordered, create a unique row, tied to the original order ID.

For example:

Order ID   Items Ordered
   127    Item 1, Item 2, Item 3

Should be:

Order ID   Items Ordered
   127         Item 1
   127         Item 2
   127         Item 3

I am relatively new to python/pandas. This code works, but is slow as I'm iterating over all the orders. What's a better way to do this?

temp_item_list = []

for row in raw_data.iterrows():
    for i in range(len(row['Items'])):
        temp_item_list.append((row['ID'], row['Items'][i]))

item_df = pd.DataFrame(temp_item_list)
0

4 Answers 4

4

To avoid iterating through all rows, you can use numpy.repeat to make columns Order ID while flattening the Items Ordered and make new data frame from this:

from itertools import chain
import numpy as np
import pandas as pd

# count the number of items in the items order column
items_count = df["Items Ordered"].str.count(",") + 1

# extend the Order ID column with numpy repeat and flatten Items Order column
pd.DataFrame({"Order ID": np.repeat(df["Order ID"], items_count),
              "Items Ordered": list(chain.from_iterable(df["Items Ordered"].str.split(",")))})

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

3

With reference to similar answer you may try this step-by-step guide:

>>> initial_df = pandas.read_csv("your_file_path")
>>> final_df = pandas.concat([Series(row['Order ID'], row['Items Ordered'].split(',')) for _, row in initial_df.iterrows()]).reset_index()
>>> final_df
     index    0
0   Item 1  127
1   Item 2  127
2   Item 3  127
>>> final_df.columns= ['Items Ordered','Order ID']
>>> final_df
  Items Ordered  Order ID
0        Item 1       127
1        Item 2       127
2        Item 3       127
>>> final_df[['Order ID','Items Ordered']]
   Order ID Items Ordered
0       127        Item 1
1       127        Item 2
2       127        Item 3

This will get your job done.

Comments

0

Something that might help you is pandas.DataFrame.apply. It allows you to apply a function to every row. You can define a function that grabs the items element in the row, splits it up, and calls the pandas.DataFrame.append to create a new row for each item.

Here is an example on how to use the apply function too.

Comments

0

may this is what you are looking for;

df = pd.DataFrame({'order_id':127, 'items_ordered': ['item_1, item_2, item_3']})
df1 = pd.concat([df.order_id,df.items_ordered.str.split(',').apply(pd.Series)],axis=1,ignore_index=True)
df1 = df1.pivot(index = df1.index, columns = df1.columns[0]).stack().stack().to_frame()
df1 = df1.set_index(df1.index.get_level_values(level=1))

or alternatively a more succinct solution combining pivot and set_index steps above is:

df1 = pd.melt(df, id_vars = ['order_id'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.