Process Dataframe without for loop

Question

I have a very unstructured data frame listed below. The goal is to combine the information into 5 line dataframe(combine the string in items from line 0-3, 4-8, 9-10, 11-15 and 16; the code is the same in a same line set; the code is not unique). I was able to get the index of the starting index(0,4,9,11,16...; The previous row of a starting row has column 'code' with value 'nan') without using for loop. But I couldn't think a way that not using for loop to combine these lines.. Can someone help? Thank you!

     code    item01  item02  item03  item04  item05
    0   1111    'a' 123 234 345 440
    1   1111    'b' nan nan nan nan
    2   nan     'c' nan nan nan nan
    3   nan     'd' nan nan nan nan
    4   2222    'b' 123 234 345 456
    5   2222    'b' nan nan nan nan
    6   nan     'c' nan nan nan nan
    7   nan     'd' nan nan nan nan
    8   nan     'e' nan nan nan nan
    9   3333    'd' 123 234 345 456
    10  nan     'b' nan nan nan nan
    11  1111    'c' 123 234 345 456
    12  1111    'b' nan nan nan nan
    13  nan     'c' nan nan nan nan
    14  nan     'd' nan nan nan nan
    15  nan     'e' nan nan nan nan
    16  5555    'a' nan nan nan nan

Expected Results:

     code    item01  item02  item03  item04  item05
    0   1111    'abcd'  123 234 345 440
    1   2222    'bbcde' 123 234 345 456
    2   3333    'db'    123 234 345 456
    3   1111    'cbcde' 123 234 345 456
    4   5555    'a'     123 234 345 456

"combine" is too vague. What to you want to do? sum? mean value? There are a lot of possibilities. — Valentino
– Valentino, Commented Jul 19, 2019 at 16:38

unutbu · Accepted Answer · 2019-07-19 17:56:51Z

If you define

code_notnull = pd.notnull(df['code'])

Then you can identify the start of each new group using

# True when the row is not null, but the prior row is null
mask = code_notnull & ~(code_notnull.shift(1, fill_value=False))
0      True
1     False
2     False
3     False
4      True
...

You can then define group numbers using

group_num = mask.cumsum()
0     1
1     1
2     1
3     1
4     2
...

and then group by group_num:

import numpy as np
import pandas as pd
nan = np.nan

df = pd.DataFrame({'code': [1111.0, 1111.0, nan, nan, 2222.0, 2222.0, nan, nan, nan, 3333.0, nan,
    1111.0, 1111.0, nan, nan, nan, 5555.0], 'item01': ['a', 'b', 'c', 'd',
    'b', 'b', 'c', 'd', 'e', 'd', 'b', 'c', 'b', 'c', 'd',
    'e', 'a'], 'item02': [123.0, nan, nan, nan, 123.0, nan, nan, nan, nan,
    123.0, nan, 123.0, nan, nan, nan, nan, nan], 'item03': [234.0, nan, nan, nan,
    234.0, nan, nan, nan, nan, 234.0, nan, 234.0, nan, nan, nan, nan, nan],
    'item04': [345.0, nan, nan, nan, 345.0, nan, nan, nan, nan, 345.0, nan, 345.0,
    nan, nan, nan, nan, nan], 'item05': [440.0, nan, nan, nan, 456.0, nan, nan,
    nan, nan, 456.0, nan, 456.0, nan, nan, nan, nan, nan]})

code_notnull = pd.notnull(df['code'])
mask = code_notnull & ~(code_notnull.shift(1, fill_value=False))
group_num = mask.cumsum()

# Forward-fill all NaNs. 
df = df.ffill()
grouped = df.groupby(group_num)
result = grouped.first()
result['item01'] = grouped['item01'].sum()
print(result)

yields

        code item01  item02  item03  item04  item05
code                                               
1     1111.0   abcd   123.0   234.0   345.0   440.0
2     2222.0  bbcde   123.0   234.0   345.0   456.0
3     3333.0     db   123.0   234.0   345.0   456.0
4     1111.0  cbcde   123.0   234.0   345.0   456.0
5     5555.0      a   123.0   234.0   345.0   456.0

Note that above I assumed your strings in item01 do not begin and end with single quotation marks. If they do, you could remove them with

df['item01'] = df['item01'].str[1:-1]

and then proceed as above.

import numpy as np
import pandas as pd
nan = np.nan

df = pd.DataFrame({'code': [1111.0, 1111.0, nan, nan, 2222.0, 2222.0, nan, nan, nan, 3333.0, nan,
    1111.0, 1111.0, nan, nan, nan, 5555.0], 'item01': ["'a'", "'b'", "'c'", "'d'",
    "'b'", "'b'", "'c'", "'d'", "'e'", "'d'", "'b'", "'c'", "'b'", "'c'", "'d'",
    "'e'", "'a'"], 'item02': [123.0, nan, nan, nan, 123.0, nan, nan, nan, nan,
    123.0, nan, 123.0, nan, nan, nan, nan, nan], 'item03': [234.0, nan, nan, nan,
    234.0, nan, nan, nan, nan, 234.0, nan, 234.0, nan, nan, nan, nan, nan],
    'item04': [345.0, nan, nan, nan, 345.0, nan, nan, nan, nan, 345.0, nan, 345.0,
    nan, nan, nan, nan, nan], 'item05': [440.0, nan, nan, nan, 456.0, nan, nan,
    nan, nan, 456.0, nan, 456.0, nan, nan, nan, nan, nan]})
df['item01'] = df['item01'].str[1:-1]
print(df)

yields (single quotes in df['item0'] have been removed)

      code item01  item02  item03  item04  item05
0   1111.0      a   123.0   234.0   345.0   440.0
1   1111.0      b     NaN     NaN     NaN     NaN
2      NaN      c     NaN     NaN     NaN     NaN
3      NaN      d     NaN     NaN     NaN     NaN
...

If you want to add single quotes back to the final result, you could use:

result['item01'] = "'" + result['item01'] + "'"

Valentino · Accepted Answer · 2019-07-19 17:48:22Z

1

You can do it with groupby after you created a valid grouping column with an unique code.

If all the rows of each group are contiguous and the logic to identify a new group is:

The previous row of a starting row has column 'code' with value 'nan'

you simply need to check that a code value is not null when the previous one is null. You can do this by shifting the 'code' column by one and check with a list comprehension the values of the shifted column and the original.
Then a cumulative sum will create unique values for grouping.

df['uniquecode'] = [pd.notnull(curr) and pd.isnull(prev) for curr, prev in zip(df['code'], df['code'].shift(1))]
df['uniquecode'] = df['uniquecode'].cumsum()
ddf = df.groupby('uniquecode').agg({'code':'mean', 'item01':'sum', 'item02':'sum', 'item03':'sum', 'item04':'sum', 'item05':'sum'}))
ddf['item01'] = ddf['item01'].apply(lambda x : "'" + x.replace("'","") + "'")

This returns ddf:

              code   item01  item02  item03  item04  item05
uniquecode                                                 
1           1111.0   'abcd'   123.0   234.0   345.0   440.0
2           2222.0  'bbcde'   123.0   234.0   345.0   456.0
3           3333.0     'db'   123.0   234.0   345.0   456.0
4           1111.0  'cbcde'   123.0   234.0   345.0   456.0
5           5555.0      'a'     0.0     0.0     0.0     0.0

Last line uses apply to remove unneded ' character, since all your characters are surrounded by apex.
You can get rid of the 'uniquecode' index by doing ddf.reset_index(drop=True, inplace=True)

edited Jul 19, 2019 at 17:48

answered Jul 19, 2019 at 17:03

Valentino

7,3716 gold badges22 silver badges36 bronze badges

6 Comments

Valentino Over a year ago

With your sample data works. What do you mean by "it's not unique"?

clide Over a year ago

Apologies for my expression. I just edit the example. What I mean is that the same code will come up again in the file, but I don't want to combine them.

Valentino Over a year ago

I see. Does at least each "group" start with 'a' in 'item01' column?

clide Over a year ago

No.. there is no unique 'key' in this data set... Will change the example.. One thing can be sure is that every starting row has a 'nan' value for last row column 'code'

Valentino Over a year ago

So how do you identify each start of a group? Just when a row is not null in item02, item03 etc?

|

Krishna Rao · Accepted Answer · 2019-07-22 13:56:19Z

0

Can you check this code if it works for you? ( I Edited the code)

df1=df.ffill()
df1['prev_code']=df1['code'].shift(1)
df1['grkey']=df1.reset_index().apply(lambda x: x['index'] if x.code!=x.prev_code else float('nan'), axis=1)
df1=df1.ffill().groupby('grkey').agg({'code':'first', 'item01':'sum','item02':'first','item03':'first','item04':'first','item05':'first'}).reset_index().drop('grkey',axis=1)
df1['item01']=df1['item01'].apply(lambda x: x.replace("''",""))

edited Jul 22, 2019 at 13:56

answered Jul 19, 2019 at 17:49

Krishna Rao

999 bronze badges

2 Comments

clide Over a year ago

As I mentioned above, the code the is not unique so in this way the code doesn't work. But the answer unutbu works! Anyhow, thanks for helping!

Krishna Rao Over a year ago

Sorry clide ... I didn't notice that ... I am editing the code to meet that condition .. you can test and use whichever you like based on performance.

Collectives™ on Stack Overflow

Process Dataframe without for loop

3 Answers 3

Comments

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related