Comparing dataframe rows that contain strings

Question

Consider this dataframe:

id     name           date_time                 strings   
1      'AAA'    2018-08-03 18:00:00             1125,1517,656,657
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159
1      'AAA'    2018-08-03 18:49:00             131
1      'BBB'    2018-08-03 19:41:00             0
1      'BBB'    2018-08-05 19:30:00             0
1      'AAA'    2018-08-04 11:00:00             131
1      'AAA'    2018-08-04 11:30:00             1000
1      'AAA'    2018-08-04 11:33:00             1000,5555

Firstly, I want to check group of rows that share id and name if there is a common string between each consecutive rows then match is True(some of strings column have no value so they have been filled by 0. The desired output:

id     name           date_time                 strings                    match       
1      'AAA'    2018-08-03 18:00:00             1125,128,1517,656,657       False       
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159     True       
1      'AAA'    2018-08-03 18:49:00             131                         True
1      'BBB'    2018-08-03 19:41:00             0                           False
1      'BBB'    2018-08-05 19:30:00             0                           False
1      'AAA'    2018-08-04 11:00:00             131                         True
1      'AAA'    2018-08-04 11:30:00             1000                        False
1      'AAA'    2018-08-04 11:33:00             1000,5555                   True

Then group rows by id and name and find the time difference between each consecutive rows in which match values are True if the time difference is less than 00:05:00 the flag is 1.The final output:

id     name           date_time                 strings                    diff        flag      
1      'AAA'    2018-08-03 18:00:00             1125,128,1517,656,657       00:00:00    0  
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159     00:00:00    0      
1      'AAA'    2018-08-03 18:49:00             131                         00:04:00    1
1      'BBB'    2018-08-03 19:41:00             0                           00:00:00    0
1      'BBB'    2018-08-05 19:30:00             0                           00:00:00    0
1      'AAA'    2018-08-04 11:00:00             131                         16:15:00    0
1      'AAA'    2018-08-04 11:30:00             1000                        00:00:00    0
1      'AAA'    2018-08-04 11:33:00             1000,5555                   00:33:00    0

For the first part I've tried this code but it doesn't work correctly:

grouped = df.groupby(['id','name'])
z = []
for index,row in grouped:
    z.append(list(zip(row['strings'], row['strings'].shift())))
df['match'] = [bool(set(str(s1).split(','))& set(str(s2).split(','))) for i in range(len(z)) for s1,s2 in z[i]]

For the second part I've tried different solutions no one of them is working.

any hints are appreciated.

should the six row of match be True?

ansev
– ansev

2019-12-03 19:07:57 +00:00
Commented Dec 3, 2019 at 19:07 — ansev
– ansev, Commented Dec 3, 2019 at 19:07

ansev · Accepted Answer · 2019-12-04 09:44:17Z

2

if you want to compare cad sharpens with the previous one use:

dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0')
c2=( dummies.groupby([df['id'],df['name']]).shift().eq(dummies) & dummies.ge(1) ).any(axis=1)
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
               .diff()
               .where(df['match'])
               .fillna(pd.Timedelta(hours=0)) )
print(df)

   id   name           date_time                  strings  match     diff
0   1  'AAA' 2018-08-03 18:00:00    1125,128,1517,656,657  False 00:00:00
1   1  'AAA' 2018-08-03 18:45:00  128,131,646,535,157,159   True 00:00:00
2   1  'AAA' 2018-08-03 18:49:00                      131   True 00:04:00
3   1  'BBB' 2018-08-03 19:41:00                        0  False 00:00:00
4   1  'BBB' 2018-08-05 19:30:00                        0  False 00:00:00
5   1  'AAA' 2018-08-04 11:00:00                      131   True 16:11:00
6   1  'AAA' 2018-08-04 11:30:00                     1000  False 00:00:00
7   1  'AAA' 2018-08-04 11:33:00                1000,5555   True 00:33:00

if you want to compare each row with the adjacent ones:

dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0') # or  df['strings'].ne(0)
c2=( (dummies.groupby([df['id'],df['name']],as_index=False)
             .rolling(3,center=True,min_periods=1)
             .sum()
             .gt(1) ).any(axis=1)
                     .reset_index(level=0,drop='level_0') )
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
               .diff()
               .where(df['match'])
               .fillna(pd.Timedelta(hours=0)) )
print(df)

Output

   id   name           date_time                  strings  match     diff
0   1  'AAA' 2018-08-03 18:00:00        1125,1517,656,657  False 00:00:00
1   1  'AAA' 2018-08-03 18:45:00  128,131,646,535,157,159   True 00:00:00
2   1  'AAA' 2018-08-03 18:49:00                      131   True 00:04:00
3   1  'BBB' 2018-08-03 19:41:00                        0  False 00:00:00
4   1  'BBB' 2018-08-05 19:30:00                        0  False 00:00:00
5   1  'AAA' 2018-08-04 11:00:00                      131   True 16:11:00
6   1  'AAA' 2018-08-04 11:30:00                     1000   True 00:30:00
7   1  'AAA' 2018-08-04 11:33:00                1000,5555   True 00:03:00

edited Dec 4, 2019 at 9:44

answered Dec 3, 2019 at 19:04

ansev

31k5 gold badges21 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Saraaa Over a year ago

Thank you. The six row of match should be False as 1000 does not match 131.

ansev Over a year ago

But match with the last row

ansev Over a year ago

I have updated my solution, please consider accept or upvoe

Collectives™ on Stack Overflow

Comparing dataframe rows that contain strings

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related