Count occurrences of a string in multiple string columns

Question

I have a dataframe called df that looks similar to this (except the number of 'mat_deliv' columns goes up to mat_deliv_8, there are several hundred clients and a number of other columns between Client_ID and mat_deliv_1 - I have simplified it here).

Client_ID  mat_deliv_1  mat_deliv_2  mat_deliv_3  mat_deliv_4
C1019876   xxx,yyy,zzz  aaa,bbb,xxx  xxx          ddd
C1018765   yyy,zzz      xxx          xxx          None
C1017654   yyy,xxx      aaa,bbb      ccc          ddd
C1016543   aaa,bbb      ccc          None         None
C1019876   yyy          None         None         None

I want to create a new column called xxx_count which counts the number of times xxx appears in mat_deliv_1, mat_deliv_2, mat_deliv_3 and mat_deliv_4. The values should look like this:

Client_ID  mat_deliv_1  mat_deliv_2  mat_deliv_3  mat_deliv_4  xxx_count
C1019876   xxx,yyy,zzz  aaa,xxx,bbb  xxx          ddd          3
C1018765   yyy,zzz      xxx          xxx          None         2
C1017654   yyy,xxx      aaa,bbb      ccc          ddd          1
C1016543   aaa,bbb      ccc          None         None         0
C1015432   yyy          None         None         None         0

I have tried the following code:

df = df.assign(xxx_count=df.loc[:, "mat_deliv_1":"mat_deliv_4"].\
               apply(lambda col: col.str.count('xxx')).fillna(0).astype(int))

But it does not produce a count, only a binary variable where 0 = no cases of xxx and 1 = the presence of xxx in at least one of the four mat_deliv columns.

NB: this is a follow-up question to that asked here: Creating a column based on the presence of part of a string in multiple other columns

cs95 · Accepted Answer · 2019-01-08 20:08:11Z

3

Try joining them horizontally before counting?

df['counts'] = (df.loc[:, "mat_deliv_1":"mat_deliv_4"]
                  .fillna('')
                  .agg(','.join, 1)
                  .str.count('xxx'))
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

This will work assuming "xxx" occurs upto only once per column. If it occurs more than once, it will count each occurrence.

Another option involves stack:

df['counts'] = (
    df.loc[:, "mat_deliv_1":"mat_deliv_4"].stack().str.count('xxx').sum(level=0))
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

This can easily be modified to count just the first occurrence, using str.contains:

df['counts'] = (
    df.loc[:, "mat_deliv_1":"mat_deliv_4"].stack().str.contains('xxx').sum(level=0))

If it is possible for "xxx" to be a substring, first split and then count:

df['counts'] = (df.loc[:, "mat_deliv_1":"mat_deliv_4"]
                  .stack()
                  .str.split(',', expand=True)
                  .eq('xxx')
                  .any(1)  # change to `.sum(1)` to count all occurrences
                  .sum(level=0))

For performance, use a list comprehension:

df['counts'] = [
    ','.join(x).count('xxx') 
    for x in df.loc[:, "mat_deliv_1":"mat_deliv_4"].fillna('').values
]
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

Why is a loop faster than using str methods or apply? See For loops with pandas - When should I care?.

edited Jan 8, 2019 at 20:08

answered Jan 8, 2019 at 19:59

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

FGreen Over a year ago

This worked perfectly - I went for the final suggestion, using list comprehension. Thanks very much for your help

Alexander Over a year ago

A point to note with count is that it will include a matching substring within a larger string, e.g. xxx and xxxx will both count. If that is OK, then fine. If not, then one needs to test for equality, e.g.

[sum(1 for word in ','.join(row).split(',') if word == 'xxx') for row in df.loc[:, "mat_deliv_1":"mat_deliv_4"].fillna('').values]

BENY · Accepted Answer · 2019-01-08 20:04:14Z

2

Using str.findall

df.iloc[:,1:].apply(lambda x : x.str.findall('xxx')).sum(1).str.len()
Out[433]: 
0    3
1    2
2    1
3    0
4    0
dtype: int64

answered Jan 8, 2019 at 20:04

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

FGreen Over a year ago

Thanks for the response - however, I received an error message after running it on my df that .str could not be run on an object dtype, so I opted for the answer below

jpp · Accepted Answer · 2019-01-08 20:55:09Z

0

You can use split by ,, then use a lambda within a lambda. The advantage of this solution is you don't see incorrect results if xxx exists as a substring of a yyy.

df['xxx_count'] = df.filter(like='mat_deliv').apply(lambda x: x.str.split(',')\
                                                    .apply(lambda x: 'xxx' in x)).sum(1)

print(df)

  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  xxx_count
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd          3
1  C1018765      yyy,zzz          xxx         xxx        None          2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd          1
3  C1016543      aaa,bbb          ccc        None        None          0
4  C1019876          yyy         None        None        None          0

Or, better, use a function:

def sum_counts(series, value):
    def finder(item, value):
        return value in item
    return series.str.split(',').apply(finder, value=value)

df['xxx_count'] = df.filter(like='mat_deliv').apply(sum_counts, value='xxx').sum(1)

edited Jan 8, 2019 at 20:55

answered Jan 8, 2019 at 20:05

jpp

166k37 gold badges301 silver badges362 bronze badges

2 Comments

cs95 Over a year ago

"then use a lambda within a lambda" ... cries internally

jpp Over a year ago

@coldspeed, Haha, I was going to update with a function too! ... Done.

Collectives™ on Stack Overflow

Count occurrences of a string in multiple string columns

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related