create a single column containing all non-null values from multiple columns in a pandas dataframe

Question

I've done this with iterrows(), but hoping there is a faster and more elegant way to achieve the desired outcome.

Problem Statement:

I have several rows of NaN and notnull values across a subset of columns (product1, product2, ...) in a dataframe (df_orders). I want to take every non-null value in this subset and create a new column containing every value starting from the first row all the way to the last.

Example: Create a single column containing all the products ordered.

>>> df_orders = pd.read_csv('orders.csv')

>>> df_orders 

OrderNo              CustName  Product1  Product2  Product3  Product4  Product5
    0    20043          Sanjay Singh       131       320   320   131       nan
    1    20042        William Sonoma       420       420   131   320       511
    2    20041          Maria Alonso       320       420   320   nan       nan
    3    20040              Jim Beam       511       131   nan   nan       nan
    4    20039          Gunter Grass       320       131   131   131       nan
    5    20038         Billy Joe Bob       420       511   511   nan       nan
    6    20037  Cynthia Silvia Stout        55        12   131    55        12
    7    20036         Alan Ginsburg       131       320   320    12       nan
    8    20035       Ronald McDonald       131       131   511   nan       nan

The result I'm looking for:

Create a new dataframe called df_product_list. Starting with the first row in df_orders, create a new row in df_product_list for each non-null product column value.

Because the order from Sanjay Singh is first and has four non-null values in the product columns, the first four rows of the df_product_list will be 131, 320, 320, and 131.

>>> df_product_list
ProdCode
0    131
1    320
2    320
3    131
4    420
5    420
6    131
7    320
8    511
9    320
10   420
11   320
12   511
13   131
14   320
15   131
16   131
17   131
...
...

cs95 · Accepted Answer · 2018-08-13 19:22:20Z

2

Let's try filter and stack?

pd.Series(df.filter(like='Product').stack().values, name='product_list')

0     131.0
1     320.0
2     320.0
3     131.0
4     420.0
5     420.0
...

For performance, you may want to operate in numpy space and drop NaNs with np.isnan (DataFrame.stack does it, but at a much higher cost than desired).

arr = df.filter(like='Product').values.ravel()
pd.Series(arr[~np.isnan(arr)].astype(np.int), name='product_list')
0     131.0
1     320.0
2     320.0
3     131.0
4     420.0
5     420.0
...

answered Aug 13, 2018 at 19:22

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

rabbitholefinder Over a year ago

I couldn't get the np.isnan to work, but the first option worked very well. Thanks for helping me out!

Collectives™ on Stack Overflow

create a single column containing all non-null values from multiple columns in a pandas dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related