0

I have a very unstructured data frame listed below. The goal is to combine the information into 5 line dataframe(combine the string in items from line 0-3, 4-8, 9-10, 11-15 and 16; the code is the same in a same line set; the code is not unique). I was able to get the index of the starting index(0,4,9,11,16...; The previous row of a starting row has column 'code' with value 'nan') without using for loop. But I couldn't think a way that not using for loop to combine these lines.. Can someone help? Thank you!

     code    item01  item02  item03  item04  item05
    0   1111    'a' 123 234 345 440
    1   1111    'b' nan nan nan nan
    2   nan     'c' nan nan nan nan
    3   nan     'd' nan nan nan nan
    4   2222    'b' 123 234 345 456
    5   2222    'b' nan nan nan nan
    6   nan     'c' nan nan nan nan
    7   nan     'd' nan nan nan nan
    8   nan     'e' nan nan nan nan
    9   3333    'd' 123 234 345 456
    10  nan     'b' nan nan nan nan
    11  1111    'c' 123 234 345 456
    12  1111    'b' nan nan nan nan
    13  nan     'c' nan nan nan nan
    14  nan     'd' nan nan nan nan
    15  nan     'e' nan nan nan nan
    16  5555    'a' nan nan nan nan

Expected Results:

     code    item01  item02  item03  item04  item05
    0   1111    'abcd'  123 234 345 440
    1   2222    'bbcde' 123 234 345 456
    2   3333    'db'    123 234 345 456
    3   1111    'cbcde' 123 234 345 456
    4   5555    'a'     123 234 345 456
5
  • 1
    show expected result. Commented Jul 19, 2019 at 16:37
  • "combine" is too vague. What to you want to do? sum? mean value? There are a lot of possibilities. Commented Jul 19, 2019 at 16:38
  • Just edit the question. Thank you! Commented Jul 19, 2019 at 16:45
  • groupby with agg ' Commented Jul 19, 2019 at 16:45
  • I don't think groupby works cuz the code is not unique.. Commented Jul 19, 2019 at 16:47

3 Answers 3

4

If you define

code_notnull = pd.notnull(df['code'])    

Then you can identify the start of each new group using

# True when the row is not null, but the prior row is null
mask = code_notnull & ~(code_notnull.shift(1, fill_value=False))
0      True
1     False
2     False
3     False
4      True
...

You can then define group numbers using

group_num = mask.cumsum()
0     1
1     1
2     1
3     1
4     2
...

and then group by group_num:

import numpy as np
import pandas as pd
nan = np.nan

df = pd.DataFrame({'code': [1111.0, 1111.0, nan, nan, 2222.0, 2222.0, nan, nan, nan, 3333.0, nan,
    1111.0, 1111.0, nan, nan, nan, 5555.0], 'item01': ['a', 'b', 'c', 'd',
    'b', 'b', 'c', 'd', 'e', 'd', 'b', 'c', 'b', 'c', 'd',
    'e', 'a'], 'item02': [123.0, nan, nan, nan, 123.0, nan, nan, nan, nan,
    123.0, nan, 123.0, nan, nan, nan, nan, nan], 'item03': [234.0, nan, nan, nan,
    234.0, nan, nan, nan, nan, 234.0, nan, 234.0, nan, nan, nan, nan, nan],
    'item04': [345.0, nan, nan, nan, 345.0, nan, nan, nan, nan, 345.0, nan, 345.0,
    nan, nan, nan, nan, nan], 'item05': [440.0, nan, nan, nan, 456.0, nan, nan,
    nan, nan, 456.0, nan, 456.0, nan, nan, nan, nan, nan]})

code_notnull = pd.notnull(df['code'])
mask = code_notnull & ~(code_notnull.shift(1, fill_value=False))
group_num = mask.cumsum()

# Forward-fill all NaNs. 
df = df.ffill()
grouped = df.groupby(group_num)
result = grouped.first()
result['item01'] = grouped['item01'].sum()
print(result)

yields

        code item01  item02  item03  item04  item05
code                                               
1     1111.0   abcd   123.0   234.0   345.0   440.0
2     2222.0  bbcde   123.0   234.0   345.0   456.0
3     3333.0     db   123.0   234.0   345.0   456.0
4     1111.0  cbcde   123.0   234.0   345.0   456.0
5     5555.0      a   123.0   234.0   345.0   456.0

Note that above I assumed your strings in item01 do not begin and end with single quotation marks. If they do, you could remove them with

df['item01'] = df['item01'].str[1:-1]

and then proceed as above.

import numpy as np
import pandas as pd
nan = np.nan

df = pd.DataFrame({'code': [1111.0, 1111.0, nan, nan, 2222.0, 2222.0, nan, nan, nan, 3333.0, nan,
    1111.0, 1111.0, nan, nan, nan, 5555.0], 'item01': ["'a'", "'b'", "'c'", "'d'",
    "'b'", "'b'", "'c'", "'d'", "'e'", "'d'", "'b'", "'c'", "'b'", "'c'", "'d'",
    "'e'", "'a'"], 'item02': [123.0, nan, nan, nan, 123.0, nan, nan, nan, nan,
    123.0, nan, 123.0, nan, nan, nan, nan, nan], 'item03': [234.0, nan, nan, nan,
    234.0, nan, nan, nan, nan, 234.0, nan, 234.0, nan, nan, nan, nan, nan],
    'item04': [345.0, nan, nan, nan, 345.0, nan, nan, nan, nan, 345.0, nan, 345.0,
    nan, nan, nan, nan, nan], 'item05': [440.0, nan, nan, nan, 456.0, nan, nan,
    nan, nan, 456.0, nan, 456.0, nan, nan, nan, nan, nan]})
df['item01'] = df['item01'].str[1:-1]
print(df)

yields (single quotes in df['item0'] have been removed)

      code item01  item02  item03  item04  item05
0   1111.0      a   123.0   234.0   345.0   440.0
1   1111.0      b     NaN     NaN     NaN     NaN
2      NaN      c     NaN     NaN     NaN     NaN
3      NaN      d     NaN     NaN     NaN     NaN
...

If you want to add single quotes back to the final result, you could use:

result['item01'] = "'" + result['item01'] + "'"
Sign up to request clarification or add additional context in comments.

Comments

1

You can do it with groupby after you created a valid grouping column with an unique code.

If all the rows of each group are contiguous and the logic to identify a new group is:

The previous row of a starting row has column 'code' with value 'nan'

you simply need to check that a code value is not null when the previous one is null. You can do this by shifting the 'code' column by one and check with a list comprehension the values of the shifted column and the original.
Then a cumulative sum will create unique values for grouping.

df['uniquecode'] = [pd.notnull(curr) and pd.isnull(prev) for curr, prev in zip(df['code'], df['code'].shift(1))]
df['uniquecode'] = df['uniquecode'].cumsum()
ddf = df.groupby('uniquecode').agg({'code':'mean', 'item01':'sum', 'item02':'sum', 'item03':'sum', 'item04':'sum', 'item05':'sum'}))
ddf['item01'] = ddf['item01'].apply(lambda x : "'" + x.replace("'","") + "'")

This returns ddf:

              code   item01  item02  item03  item04  item05
uniquecode                                                 
1           1111.0   'abcd'   123.0   234.0   345.0   440.0
2           2222.0  'bbcde'   123.0   234.0   345.0   456.0
3           3333.0     'db'   123.0   234.0   345.0   456.0
4           1111.0  'cbcde'   123.0   234.0   345.0   456.0
5           5555.0      'a'     0.0     0.0     0.0     0.0

Last line uses apply to remove unneded ' character, since all your characters are surrounded by apex.
You can get rid of the 'uniquecode' index by doing ddf.reset_index(drop=True, inplace=True)

6 Comments

With your sample data works. What do you mean by "it's not unique"?
Apologies for my expression. I just edit the example. What I mean is that the same code will come up again in the file, but I don't want to combine them.
I see. Does at least each "group" start with 'a' in 'item01' column?
No.. there is no unique 'key' in this data set... Will change the example.. One thing can be sure is that every starting row has a 'nan' value for last row column 'code'
So how do you identify each start of a group? Just when a row is not null in item02, item03 etc?
|
0

Can you check this code if it works for you? ( I Edited the code)

df1=df.ffill()
df1['prev_code']=df1['code'].shift(1)
df1['grkey']=df1.reset_index().apply(lambda x: x['index'] if x.code!=x.prev_code else float('nan'), axis=1)
df1=df1.ffill().groupby('grkey').agg({'code':'first', 'item01':'sum','item02':'first','item03':'first','item04':'first','item05':'first'}).reset_index().drop('grkey',axis=1)
df1['item01']=df1['item01'].apply(lambda x: x.replace("''",""))

2 Comments

As I mentioned above, the code the is not unique so in this way the code doesn't work. But the answer unutbu works! Anyhow, thanks for helping!
Sorry clide ... I didn't notice that ... I am editing the code to meet that condition .. you can test and use whichever you like based on performance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.