1

Given the following dataframe

data = [[1, 'Yes','A','No','Yes','No','No','No'],
        [2, 'Yes','A','No','No','Yes','No','No'],
        [3, 'Yes','B','No','No','Yes','No','No'],
        [4, 'No','','','','','',''],
        [5, 'No','','','','','',''],
        [6, 'Yes','C','No','No','Yes','Yes','No'],
        [7, 'Yes','A','No','Yes','No','No','No'],
        [8, 'Yes','A','No','No','Yes','No','No'],
        [9, 'No','','','','','',''],
        [10, 'Yes','B','Yes','Yes','No','No','No']]
df = pd.DataFrame(data,columns=['Cust_ID','OrderMade','OrderType','OrderCategoryA','OrderCategoryB','OrderCategoryC','OrderCategoryD'])


+----+-----------+-------------+-------------+------------------+------------------+------------------+------------------+
|    |   Cust_ID | OrderMade   | OrderType   | OrderCategoryA   | OrderCategoryB   | OrderCategoryC   | OrderCategoryD   |
|----+-----------+-------------+-------------+------------------+------------------+------------------+------------------|
|  0 |         1 | Yes         | A           | No               | Yes              | No               | No               |
|  1 |         2 | Yes         | A           | No               | No               | Yes              | No               |
|  2 |         3 | Yes         | B           | No               | No               | Yes              | No               |
|  3 |         4 | No          |             |                  |                  |                  |                  |
|  4 |         5 | No          |             |                  |                  |                  |                  |
|  5 |         6 | Yes         | C           | No               | No               | Yes              | Yes              |
|  6 |         7 | Yes         | A           | No               | Yes              | No               | No               |
|  7 |         8 | Yes         | A           | No               | No               | Yes              | No               |
|  8 |         9 | No          |             |                  |                  |                  |                  |
|  9 |        10 | Yes         | B           | Yes              | Yes              | No               | No               |
+----+-----------+-------------+-------------+------------------+------------------+------------------+------------------+

How can I transform this to make rows based on the OrderCategory?

+--------+-----------+----------+----------------+
|Cust_ID | OrderMade |OrderType | OrderCategory  |
|--------+-----------+----------+----------------|
|1       |   Yes     |    A     | OrderCategoryB |
|2       |   Yes     |    A     | OrderCategoryC |
|3       |   Yes     |    B     | OrderCategoryC |
|4       |   No      |          |                |
|5       |   No      |          |                |
|6       |   Yes     |    C     | OrderCategoryC |
|6       |   Yes     |    C     | OrderCategoryD |
|7       |   Yes     |    A     | OrderCategoryB |
|8       |   Yes     |    A     | OrderCategoryC |
|9       |   No      |          |                |
|10      |   Yes     |    B     | OrderCategoryA |
|10      |   Yes     |    B     | OrderCategoryB |
+--------+-----------+----------+----------------+

I tried to use crosstab to start with one OrderCategory, and planned to duplicate for each category, but this seems inefficient and I wasn't sure how to proceed to get my desired result...

imgCROSS = pd.crosstab(df["Cust_ID"], df["OrderCategoryA"])

Returns...

OrderCategoryA     No  Yes
Cust_ID                   
1               0   1    0
2               0   1    0
3               0   1    0
4               1   0    0
5               1   0    0
6               0   1    0
7               0   1    0
8               0   1    0
9               1   0    0
10              0   0    1

I also thought I could populate a new empty column called Category and iterate over each row, populating the appropriate category based on the Yes/No value, but this wouldn't work for rows which have multiple categories. Also, the below implementation of this idea returned an empty column.

imgRaw["Category"] = ""
for index, row in df.iterrows():
    catA = row["OrderCategoryA"]
    catB = row["OrderCategoryB"]
    catC = row["OrderCategoryC"]
    catD = row["OrderCategoryD"]

    if catA == "Yes":
        row["Category"] = "OrderCategoryA"
    elif catB == "Yes":
        row["Category"] = "OrderCategoryB"
    elif catC == "Yes":
        row["Category"] = "OrderCategoryC"
    elif catD == "Yes":
        row["Category"] = "OrderCategoryD"

I know I need to transform the dataframe, probably multiple times before I can get my desired result. Just stuck on how to proceed.

4 Answers 4

3

Let's use pandas in four steps:

df_1 = df.set_index(['Cust_ID', 'OrderMade', 'OrderType'])

df_2 = df_1.where((df_1 == "Yes") | (df_1 == "")).rename_axis('OrderCategory', axis=1).stack().reset_index()

df_2['OrderCategory'] = df_2['OrderCategory'].mask(df_2['OrderMade'] == 'No','')

df_2.drop_duplicates().drop(0, axis=1)

Output:

    Cust_ID OrderMade OrderType   OrderCategory
0         1       Yes         A  OrderCategoryB
1         2       Yes         A  OrderCategoryC
2         3       Yes         B  OrderCategoryC
3         4        No                          
8         5        No                          
13        6       Yes         C  OrderCategoryC
14        6       Yes         C  OrderCategoryD
15        7       Yes         A  OrderCategoryB
16        8       Yes         A  OrderCategoryC
17        9        No                          
22       10       Yes         B  OrderCategoryA
23       10       Yes         B  OrderCategoryB
Sign up to request clarification or add additional context in comments.

Comments

1

Here is one way to do it (I did have to modify your original dataframe so that it only had one OrderCategoryD instead of two... hopefully that was a typo):

keep_cols = ['Cust_ID','OrderMade','OrderType']
build = pd.DataFrame()

for col in df.columns:
   if 'OrderCategory' in col:
     cat = col[-1:]                              # Get the category letter
     temp = df.loc[df[col] == 'Yes', keep_cols]  # Get all the rows with a yes in this column
     temp['OrderCategory'] = cat                 # Append a column with the correct letter
     build = build.append(temp)                  # Append that df to our new df

# Once that's done, get all the rows that have a 'No' in the OrderMade column
final = pd.merge(build, df[keep_cols], how='right').sort_values('Cust_ID')
final = final.reset_index().drop(columns=['index'])

Comments

1

Add another Category Column representing the 'No's in 'OrderMade'

This generalizes the problem and enables us to use a more uniform method.

d = df.assign(**{'': df.OrderMade.map({'Yes': 'No', 'No': 'Yes'})})
ids, cat = np.split(d, [3], 1)  # split between 3rd and 4th columns
i, j = np.where(cat.eq('Yes'))

ids.iloc[i].assign(OrderCategory=cat.columns[j])

  Cust_ID OrderMade OrderType   OrderCategory
0       1       Yes         A  OrderCategoryB
1       2       Yes         A  OrderCategoryC
2       3       Yes         B  OrderCategoryC
3       4        No                          
4       5        No                          
5       6       Yes         C  OrderCategoryC
5       6       Yes         C  OrderCategoryD
6       7       Yes         A  OrderCategoryB
7       8       Yes         A  OrderCategoryC
8       9        No                          
9      10       Yes         B  OrderCategoryA
9      10       Yes         B  OrderCategoryB

melt

Adding the column simplifies the melt as well

d = df.assign(**{'': df.OrderMade.map({'Yes': 'No', 'No': 'Yes'})})
d.melt(['Cust_ID', 'OrderMade', 'OrderType'], var_name='OrderCategory') \
 .query('value == "Yes"').drop('value', 1).sort_values('Cust_ID')

    Cust_ID OrderMade OrderType   OrderCategory
10        1       Yes         A  OrderCategoryB
21        2       Yes         A  OrderCategoryC
22        3       Yes         B  OrderCategoryC
53        4        No                          
54        5        No                          
25        6       Yes         C  OrderCategoryC
35        6       Yes         C  OrderCategoryD
16        7       Yes         A  OrderCategoryB
27        8       Yes         A  OrderCategoryC
58        9        No                          
9        10       Yes         B  OrderCategoryA
19       10       Yes         B  OrderCategoryB

Comments

0

As suggested in the other answer, you want melt with a some extra cleaning, and merge:

id_cols = ['Cust_ID','OrderMade','OrderType']
new_df = df[df.OrderMade.eq('Yes')].melt(id_vars=id_cols, var_name='OrderCategory')


new_df[new_df['value'].ne('No')]
        .merge(df.loc[df.OrderMade.eq('No'), 
                      ['Cust_ID','OrderMade','OrderType']],
               how='outer')
        .drop('value',axis=1)

Output:

    Cust_ID OrderMade OrderType   OrderCategory
0        10       Yes         B  OrderCategoryA
1        10       Yes         B  OrderCategoryB
2         1       Yes         A  OrderCategoryB
3         7       Yes         A  OrderCategoryB
4         2       Yes         A  OrderCategoryC
5         3       Yes         B  OrderCategoryC
6         6       Yes         C  OrderCategoryC
7         6       Yes         C  OrderCategoryD
8         8       Yes         A  OrderCategoryC
9         4        No                       NaN
10        5        No                       NaN
11        9        No                       NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.