4

I have a dataframe called df that looks similar to this (except the number of 'mat_deliv' columns goes up to mat_deliv_8, there are several hundred clients and a number of other columns between Client_ID and mat_deliv_1 - I have simplified it here).

Client_ID  mat_deliv_1  mat_deliv_2  mat_deliv_3  mat_deliv_4
C1019876   xxx,yyy,zzz  aaa,bbb,xxx  xxx          ddd
C1018765   yyy,zzz      xxx          xxx          None
C1017654   yyy,xxx      aaa,bbb      ccc          ddd
C1016543   aaa,bbb      ccc          None         None
C1019876   yyy          None         None         None

I want to create a new column called xxx_count which counts the number of times xxx appears in mat_deliv_1, mat_deliv_2, mat_deliv_3 and mat_deliv_4. The values should look like this:

Client_ID  mat_deliv_1  mat_deliv_2  mat_deliv_3  mat_deliv_4  xxx_count
C1019876   xxx,yyy,zzz  aaa,xxx,bbb  xxx          ddd          3
C1018765   yyy,zzz      xxx          xxx          None         2
C1017654   yyy,xxx      aaa,bbb      ccc          ddd          1
C1016543   aaa,bbb      ccc          None         None         0
C1015432   yyy          None         None         None         0

I have tried the following code:

df = df.assign(xxx_count=df.loc[:, "mat_deliv_1":"mat_deliv_4"].\
               apply(lambda col: col.str.count('xxx')).fillna(0).astype(int))

But it does not produce a count, only a binary variable where 0 = no cases of xxx and 1 = the presence of xxx in at least one of the four mat_deliv columns.

NB: this is a follow-up question to that asked here: Creating a column based on the presence of part of a string in multiple other columns

0

3 Answers 3

3

Try joining them horizontally before counting?

df['counts'] = (df.loc[:, "mat_deliv_1":"mat_deliv_4"]
                  .fillna('')
                  .agg(','.join, 1)
                  .str.count('xxx'))
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

This will work assuming "xxx" occurs upto only once per column. If it occurs more than once, it will count each occurrence.


Another option involves stack:

df['counts'] = (
    df.loc[:, "mat_deliv_1":"mat_deliv_4"].stack().str.count('xxx').sum(level=0))
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

This can easily be modified to count just the first occurrence, using str.contains:

df['counts'] = (
    df.loc[:, "mat_deliv_1":"mat_deliv_4"].stack().str.contains('xxx').sum(level=0))

If it is possible for "xxx" to be a substring, first split and then count:

df['counts'] = (df.loc[:, "mat_deliv_1":"mat_deliv_4"]
                  .stack()
                  .str.split(',', expand=True)
                  .eq('xxx')
                  .any(1)  # change to `.sum(1)` to count all occurrences
                  .sum(level=0))

For performance, use a list comprehension:

df['counts'] = [
    ','.join(x).count('xxx') 
    for x in df.loc[:, "mat_deliv_1":"mat_deliv_4"].fillna('').values
]
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

Why is a loop faster than using str methods or apply? See For loops with pandas - When should I care?.

Sign up to request clarification or add additional context in comments.

2 Comments

This worked perfectly - I went for the final suggestion, using list comprehension. Thanks very much for your help
A point to note with count is that it will include a matching substring within a larger string, e.g. xxx and xxxx will both count. If that is OK, then fine. If not, then one needs to test for equality, e.g. [sum(1 for word in ','.join(row).split(',') if word == 'xxx') for row in df.loc[:, "mat_deliv_1":"mat_deliv_4"].fillna('').values]
2

Using str.findall

df.iloc[:,1:].apply(lambda x : x.str.findall('xxx')).sum(1).str.len()
Out[433]: 
0    3
1    2
2    1
3    0
4    0
dtype: int64

1 Comment

Thanks for the response - however, I received an error message after running it on my df that .str could not be run on an object dtype, so I opted for the answer below
0

You can use split by ,, then use a lambda within a lambda. The advantage of this solution is you don't see incorrect results if xxx exists as a substring of a yyy.

df['xxx_count'] = df.filter(like='mat_deliv').apply(lambda x: x.str.split(',')\
                                                    .apply(lambda x: 'xxx' in x)).sum(1)

print(df)

  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  xxx_count
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd          3
1  C1018765      yyy,zzz          xxx         xxx        None          2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd          1
3  C1016543      aaa,bbb          ccc        None        None          0
4  C1019876          yyy         None        None        None          0

Or, better, use a function:

def sum_counts(series, value):
    def finder(item, value):
        return value in item
    return series.str.split(',').apply(finder, value=value)

df['xxx_count'] = df.filter(like='mat_deliv').apply(sum_counts, value='xxx').sum(1)

2 Comments

"then use a lambda within a lambda" ... cries internally
@coldspeed, Haha, I was going to update with a function too! ... Done.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.