Pandas DataFrame - Adding rows to df based on data in df

Question

Apologies for the not so specified title. I've been, unsuccesfully so far, trying to come up with a way to add new 'rows' to a pandas dataframe based on the contents of some of the columns. I hope to make it clear with an example. The data is mock-up data which hopefully suffices in painting the bigger picture.

So, lets say a car dealer has, among others, the following 7 customers. In the dataframe you can see their customer-id, their gender (because why not), and the country they currently live in. In addition, you can see whether they've bought any of four car brands (and which type of car) or not (NA) (all values in the dataframe are strings btw). For example, Customer 4 is a female from Russia, and she has bought a Porsche 911 from the dealer.

        Cust-id Sex Country Audi Ferrari Porsche Jaguar
    0   Cu1      F    FR     R8    FF      NA     NA
    1   Cu2      M    US     NA    NA      NA     XF
    2   Cu3      M    UK     RS7   NA      NA     NA
    3   Cu4      F    RU     NA    NA      911    NA
    4   Cu5      M    US     NA    NA      918    Ford
    5   Cu6      F    US     S6    NA      NA     F-type
    6   Cu7      M    UK     A8    NA      MacanS XE

What i'd like to be able to do is create new rows for those cases where a customer has bought more than one car, with each row only specifying one car, and the other car brand columns all saying 'NA' in that specific row. For the above example this would result in the following dataframe.

            Cust-id Sex Country Audi Ferrari Porsche Jaguar
    0         Cu1    F    FR     R8    NA      NA     NA
    1         Cu1    F    FR     NA    FF      NA     NA
    2         Cu2    M    US     NA    NA      NA     XF
    3         Cu3    M    UK     RS7   NA      NA     NA
    4         Cu4    F    RU     NA    NA      911    NA
    5         Cu5    M    US     NA    NA      918    NA
    6         Cu5    M    US     NA    NA      NA     Ford
    7         Cu6    F    US     S6    NA      NA     F-type
    8         Cu7    M    UK     A8    NA      NA     NA
    9         Cu7    M    UK     NA    NA      MacanS NA
    10        Cu7    M    UK     NA    NA      NA     XE

This means that an original row with three cars specified would lead to three new rows each specifying only one of the cars (with the original row gone). The Cust-id, Sex, and Country values do not change. First time using the website to ask a question myself so hopefully the formatting is not too bad. Appreciate any help/guidance. python pandas dataframe

Sorry, what was wrong with my answer or your question? stackoverflow.com/questions/38523891/… ? — jezrael
– jezrael, Commented Jul 22, 2016 at 10:54

elelias · Accepted Answer · 2016-07-19 13:26:21Z

1

The way I would approach this is the following:

Iterate over every car column and keep only the records that have non-null values

df_dict = {}

for car in ['Audi', 'Ferrari', 'Porsche' ,'Jaguar']:  

    non_nulls = df[ df.apply(lambda x: not pd.isnull(x[car] ), axis=1)]

    df_dict[car] = non_nulls[[Cust-id,Sex,Country, car]]

concatenate the dataframes with pd.concat, this will create the nulls in the right places
```
final_df = pd.concat( df_dict.values() )
```

Something along those lines should work. Did not test my code though, so use your own judgement!

edited Jul 19, 2016 at 13:26

answered Jul 19, 2016 at 12:55

elelias

4,7996 gold badges33 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

J_Dav Over a year ago

Thanks for your answer elelias, much appreciated. After executing the code im getting a TypeError ("cannot concatenate a non-NDFrame object"). Any idea how to fix this?

elelias Over a year ago

sorry, change the items() to values(), I'll edit the answer

unutbu · Accepted Answer · 2016-07-19 13:50:40Z

import pandas as pd

df = pd.DataFrame({'Audi': ['R8', 'NA', 'RS7', 'NA', 'NA', 'S6', 'A8'],
 'Country': ['FR', 'US', 'UK', 'RU', 'US', 'US', 'UK'],
 'Cust-id': ['Cu1', 'Cu2', 'Cu3', 'Cu4', 'Cu5', 'Cu6', 'Cu7'],
 'Ferrari': ['FF', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'],
 'Jaguar': ['NA', 'XF', 'NA', 'NA', 'Ford', 'F-type', 'XE'],
 'Porsche': ['NA', 'NA', 'NA', '911', '918', 'NA', 'MacanS'],
 'Sex': ['F', 'M', 'M', 'F', 'M', 'F', 'M']})

result = pd.melt(df, id_vars=['Cust-id', 'Sex', 'Country'])
mask = result['value'] != 'NA'
result = result.loc[mask]
result['index'] = result.index
result = pd.concat([result[['Cust-id', 'Sex', 'Country']], 
           result.pivot(index='index', columns='variable', values='value')], axis=1)

print(result)

yields

   Cust-id Sex Country  Audi Ferrari  Jaguar Porsche
0      Cu1   F      FR    R8    None    None    None
2      Cu3   M      UK   RS7    None    None    None
5      Cu6   F      US    S6    None    None    None
6      Cu7   M      UK    A8    None    None    None
7      Cu1   F      FR  None      FF    None    None
15     Cu2   M      US  None    None      XF    None
18     Cu5   M      US  None    None    Ford    None
19     Cu6   F      US  None    None  F-type    None
20     Cu7   M      UK  None    None      XE    None
24     Cu4   F      RU  None    None    None     911
25     Cu5   M      US  None    None    None     918
27     Cu7   M      UK  None    None    None  MacanS

You could use melt to coalesce the car columns into a single column:

In [232]: result = pd.melt(df, id_vars=['Cust-id', 'Sex', 'Country']); result.head()
Out[232]: 
  Cust-id Sex Country variable value
0     Cu1   F      FR     Audi    R8
1     Cu2   M      US     Audi    NA
2     Cu3   M      UK     Audi   RS7
3     Cu4   F      RU     Audi    NA
4     Cu5   M      US     Audi    NA
...

Remove the rows with 'NA' string values:

mask = result['value'] != 'NA'
result = result.loc[mask]

and then use pivot to reshape the result. pivot is roughly the inverse of pd.melt -- it spreads the values from one column (e.g. 'variable') across many columns, thus un-coalescing the car columns.

result['index'] = result.index
result = pd.concat([result[['Cust-id', 'Sex', 'Country']], 
           result.pivot(index='index', columns='variable', values='value')], axis=1)

result['index'] = result.index is used to make sure the pivot preserves the rows as-is.

Collectives™ on Stack Overflow

Pandas DataFrame - Adding rows to df based on data in df

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related