2

Consider this dataframe:

id     name           date_time                 strings   
1      'AAA'    2018-08-03 18:00:00             1125,1517,656,657
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159
1      'AAA'    2018-08-03 18:49:00             131
1      'BBB'    2018-08-03 19:41:00             0
1      'BBB'    2018-08-05 19:30:00             0
1      'AAA'    2018-08-04 11:00:00             131
1      'AAA'    2018-08-04 11:30:00             1000
1      'AAA'    2018-08-04 11:33:00             1000,5555

Firstly, I want to check group of rows that share id and name if there is a common string between each consecutive rows then match is True(some of strings column have no value so they have been filled by 0. The desired output:

id     name           date_time                 strings                    match       
1      'AAA'    2018-08-03 18:00:00             1125,128,1517,656,657       False       
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159     True       
1      'AAA'    2018-08-03 18:49:00             131                         True
1      'BBB'    2018-08-03 19:41:00             0                           False
1      'BBB'    2018-08-05 19:30:00             0                           False
1      'AAA'    2018-08-04 11:00:00             131                         True
1      'AAA'    2018-08-04 11:30:00             1000                        False
1      'AAA'    2018-08-04 11:33:00             1000,5555                   True

Then group rows by id and name and find the time difference between each consecutive rows in which match values are True if the time difference is less than 00:05:00 the flag is 1.The final output:

id     name           date_time                 strings                    diff        flag      
1      'AAA'    2018-08-03 18:00:00             1125,128,1517,656,657       00:00:00    0  
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159     00:00:00    0      
1      'AAA'    2018-08-03 18:49:00             131                         00:04:00    1
1      'BBB'    2018-08-03 19:41:00             0                           00:00:00    0
1      'BBB'    2018-08-05 19:30:00             0                           00:00:00    0
1      'AAA'    2018-08-04 11:00:00             131                         16:15:00    0
1      'AAA'    2018-08-04 11:30:00             1000                        00:00:00    0
1      'AAA'    2018-08-04 11:33:00             1000,5555                   00:33:00    0

For the first part I've tried this code but it doesn't work correctly:

grouped = df.groupby(['id','name'])
z = []
for index,row in grouped:
    z.append(list(zip(row['strings'], row['strings'].shift())))
df['match'] = [bool(set(str(s1).split(','))& set(str(s2).split(','))) for i in range(len(z)) for s1,s2 in z[i]]

For the second part I've tried different solutions no one of them is working.

any hints are appreciated.

1
  • should the six row of match be True? Commented Dec 3, 2019 at 19:07

1 Answer 1

2

if you want to compare cad sharpens with the previous one use:

dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0')
c2=( dummies.groupby([df['id'],df['name']]).shift().eq(dummies) & dummies.ge(1) ).any(axis=1)
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
               .diff()
               .where(df['match'])
               .fillna(pd.Timedelta(hours=0)) )
print(df)

   id   name           date_time                  strings  match     diff
0   1  'AAA' 2018-08-03 18:00:00    1125,128,1517,656,657  False 00:00:00
1   1  'AAA' 2018-08-03 18:45:00  128,131,646,535,157,159   True 00:00:00
2   1  'AAA' 2018-08-03 18:49:00                      131   True 00:04:00
3   1  'BBB' 2018-08-03 19:41:00                        0  False 00:00:00
4   1  'BBB' 2018-08-05 19:30:00                        0  False 00:00:00
5   1  'AAA' 2018-08-04 11:00:00                      131   True 16:11:00
6   1  'AAA' 2018-08-04 11:30:00                     1000  False 00:00:00
7   1  'AAA' 2018-08-04 11:33:00                1000,5555   True 00:33:00

if you want to compare each row with the adjacent ones:

dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0') # or  df['strings'].ne(0)
c2=( (dummies.groupby([df['id'],df['name']],as_index=False)
             .rolling(3,center=True,min_periods=1)
             .sum()
             .gt(1) ).any(axis=1)
                     .reset_index(level=0,drop='level_0') )
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
               .diff()
               .where(df['match'])
               .fillna(pd.Timedelta(hours=0)) )
print(df)

Output

   id   name           date_time                  strings  match     diff
0   1  'AAA' 2018-08-03 18:00:00        1125,1517,656,657  False 00:00:00
1   1  'AAA' 2018-08-03 18:45:00  128,131,646,535,157,159   True 00:00:00
2   1  'AAA' 2018-08-03 18:49:00                      131   True 00:04:00
3   1  'BBB' 2018-08-03 19:41:00                        0  False 00:00:00
4   1  'BBB' 2018-08-05 19:30:00                        0  False 00:00:00
5   1  'AAA' 2018-08-04 11:00:00                      131   True 16:11:00
6   1  'AAA' 2018-08-04 11:30:00                     1000   True 00:30:00
7   1  'AAA' 2018-08-04 11:33:00                1000,5555   True 00:03:00
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you. The six row of match should be False as 1000 does not match 131.
But match with the last row
I have updated my solution, please consider accept or upvoe

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.