1

Apologies for the not so specified title. I've been, unsuccesfully so far, trying to come up with a way to add new 'rows' to a pandas dataframe based on the contents of some of the columns. I hope to make it clear with an example. The data is mock-up data which hopefully suffices in painting the bigger picture.

So, lets say a car dealer has, among others, the following 7 customers. In the dataframe you can see their customer-id, their gender (because why not), and the country they currently live in. In addition, you can see whether they've bought any of four car brands (and which type of car) or not (NA) (all values in the dataframe are strings btw). For example, Customer 4 is a female from Russia, and she has bought a Porsche 911 from the dealer.

        Cust-id Sex Country Audi Ferrari Porsche Jaguar
    0   Cu1      F    FR     R8    FF      NA     NA
    1   Cu2      M    US     NA    NA      NA     XF
    2   Cu3      M    UK     RS7   NA      NA     NA
    3   Cu4      F    RU     NA    NA      911    NA
    4   Cu5      M    US     NA    NA      918    Ford
    5   Cu6      F    US     S6    NA      NA     F-type
    6   Cu7      M    UK     A8    NA      MacanS XE

What i'd like to be able to do is create new rows for those cases where a customer has bought more than one car, with each row only specifying one car, and the other car brand columns all saying 'NA' in that specific row. For the above example this would result in the following dataframe.

            Cust-id Sex Country Audi Ferrari Porsche Jaguar
    0         Cu1    F    FR     R8    NA      NA     NA
    1         Cu1    F    FR     NA    FF      NA     NA
    2         Cu2    M    US     NA    NA      NA     XF
    3         Cu3    M    UK     RS7   NA      NA     NA
    4         Cu4    F    RU     NA    NA      911    NA
    5         Cu5    M    US     NA    NA      918    NA
    6         Cu5    M    US     NA    NA      NA     Ford
    7         Cu6    F    US     S6    NA      NA     F-type
    8         Cu7    M    UK     A8    NA      NA     NA
    9         Cu7    M    UK     NA    NA      MacanS NA
    10        Cu7    M    UK     NA    NA      NA     XE

This means that an original row with three cars specified would lead to three new rows each specifying only one of the cars (with the original row gone). The Cust-id, Sex, and Country values do not change. First time using the website to ask a question myself so hopefully the formatting is not too bad. Appreciate any help/guidance. python pandas dataframe

1

2 Answers 2

1

The way I would approach this is the following:

  1. Iterate over every car column and keep only the records that have non-null values

    df_dict = {}
    
    for car in ['Audi', 'Ferrari', 'Porsche' ,'Jaguar']:  
    
        non_nulls = df[ df.apply(lambda x: not pd.isnull(x[car] ), axis=1)]
    
        df_dict[car] = non_nulls[[Cust-id,Sex,Country, car]]
    
  2. concatenate the dataframes with pd.concat, this will create the nulls in the right places

    final_df = pd.concat( df_dict.values() )
    

Something along those lines should work. Did not test my code though, so use your own judgement!

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer elelias, much appreciated. After executing the code im getting a TypeError ("cannot concatenate a non-NDFrame object"). Any idea how to fix this?
sorry, change the items() to values(), I'll edit the answer
0
import pandas as pd

df = pd.DataFrame({'Audi': ['R8', 'NA', 'RS7', 'NA', 'NA', 'S6', 'A8'],
 'Country': ['FR', 'US', 'UK', 'RU', 'US', 'US', 'UK'],
 'Cust-id': ['Cu1', 'Cu2', 'Cu3', 'Cu4', 'Cu5', 'Cu6', 'Cu7'],
 'Ferrari': ['FF', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'],
 'Jaguar': ['NA', 'XF', 'NA', 'NA', 'Ford', 'F-type', 'XE'],
 'Porsche': ['NA', 'NA', 'NA', '911', '918', 'NA', 'MacanS'],
 'Sex': ['F', 'M', 'M', 'F', 'M', 'F', 'M']})

result = pd.melt(df, id_vars=['Cust-id', 'Sex', 'Country'])
mask = result['value'] != 'NA'
result = result.loc[mask]
result['index'] = result.index
result = pd.concat([result[['Cust-id', 'Sex', 'Country']], 
           result.pivot(index='index', columns='variable', values='value')], axis=1)

print(result)

yields

   Cust-id Sex Country  Audi Ferrari  Jaguar Porsche
0      Cu1   F      FR    R8    None    None    None
2      Cu3   M      UK   RS7    None    None    None
5      Cu6   F      US    S6    None    None    None
6      Cu7   M      UK    A8    None    None    None
7      Cu1   F      FR  None      FF    None    None
15     Cu2   M      US  None    None      XF    None
18     Cu5   M      US  None    None    Ford    None
19     Cu6   F      US  None    None  F-type    None
20     Cu7   M      UK  None    None      XE    None
24     Cu4   F      RU  None    None    None     911
25     Cu5   M      US  None    None    None     918
27     Cu7   M      UK  None    None    None  MacanS

You could use melt to coalesce the car columns into a single column:

In [232]: result = pd.melt(df, id_vars=['Cust-id', 'Sex', 'Country']); result.head()
Out[232]: 
  Cust-id Sex Country variable value
0     Cu1   F      FR     Audi    R8
1     Cu2   M      US     Audi    NA
2     Cu3   M      UK     Audi   RS7
3     Cu4   F      RU     Audi    NA
4     Cu5   M      US     Audi    NA
...

Remove the rows with 'NA' string values:

mask = result['value'] != 'NA'
result = result.loc[mask]

and then use pivot to reshape the result. pivot is roughly the inverse of pd.melt -- it spreads the values from one column (e.g. 'variable') across many columns, thus un-coalescing the car columns.

result['index'] = result.index
result = pd.concat([result[['Cust-id', 'Sex', 'Country']], 
           result.pivot(index='index', columns='variable', values='value')], axis=1)

result['index'] = result.index is used to make sure the pivot preserves the rows as-is.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.