1

I've done this with iterrows(), but hoping there is a faster and more elegant way to achieve the desired outcome.

Problem Statement:

I have several rows of NaN and notnull values across a subset of columns (product1, product2, ...) in a dataframe (df_orders). I want to take every non-null value in this subset and create a new column containing every value starting from the first row all the way to the last.

Example: Create a single column containing all the products ordered.

>>> df_orders = pd.read_csv('orders.csv')

>>> df_orders 

OrderNo              CustName  Product1  Product2  Product3  Product4  Product5
    0    20043          Sanjay Singh       131       320   320   131       nan
    1    20042        William Sonoma       420       420   131   320       511
    2    20041          Maria Alonso       320       420   320   nan       nan
    3    20040              Jim Beam       511       131   nan   nan       nan
    4    20039          Gunter Grass       320       131   131   131       nan
    5    20038         Billy Joe Bob       420       511   511   nan       nan
    6    20037  Cynthia Silvia Stout        55        12   131    55        12
    7    20036         Alan Ginsburg       131       320   320    12       nan
    8    20035       Ronald McDonald       131       131   511   nan       nan

The result I'm looking for:

Create a new dataframe called df_product_list. Starting with the first row in df_orders, create a new row in df_product_list for each non-null product column value.

Because the order from Sanjay Singh is first and has four non-null values in the product columns, the first four rows of the df_product_list will be 131, 320, 320, and 131.

>>> df_product_list
ProdCode
0    131
1    320
2    320
3    131
4    420
5    420
6    131
7    320
8    511
9    320
10   420
11   320
12   511
13   131
14   320
15   131
16   131
17   131
...
...

1 Answer 1

2

Let's try filter and stack?

pd.Series(df.filter(like='Product').stack().values, name='product_list')

0     131.0
1     320.0
2     320.0
3     131.0
4     420.0
5     420.0
...

For performance, you may want to operate in numpy space and drop NaNs with np.isnan (DataFrame.stack does it, but at a much higher cost than desired).

arr = df.filter(like='Product').values.ravel()
pd.Series(arr[~np.isnan(arr)].astype(np.int), name='product_list')
0     131.0
1     320.0
2     320.0
3     131.0
4     420.0
5     420.0
...
Sign up to request clarification or add additional context in comments.

1 Comment

I couldn't get the np.isnan to work, but the first option worked very well. Thanks for helping me out!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.