Filter pandas (python) dataframe based on partial strings in a list

Question

I have a pandas data frame with 99 columns of dx1-dx99 & 99 columns of px1-px99. The contents of these columns are codes of varying length of 4 to 8 characters & digits.

I want to filter only those contents, from these columns, where first three characters of these contents match the three characters in the supplied list. Supplied list has strings that have only three characters.

The length of supplied list I generated dynamically and very length. Therefore I have to pass this whole list not as a separate string.

For example, I have this data frame:

df = pd.DataFrame({'A': 'foo bar one123 bar foo one324 foo 0'.split(),
                   'B': 'one546 one765 twosde three twowef two234 onedfr three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
    print(df)

        A       B  C   D
0     foo  one546  0   0
1       0  one765  1   2
2  one123  twosde  2   4
3     bar   three  3   6
4     foo  twowef  4   8
5  one324  two234  5  10
6     foo  onedfr  6  12
7       0   three  7  14

The filled cells are in object type and all zeros were originally NULL which I have filled with zeros by pd.fillna(0).

When I do this:

keep = df.iloc[:,:].isin(['one123','one324','twosde','two234']).values
df.iloc[:,:] = df.iloc[:,:].where(keep, 0)
print(df)

I got this:

        A       B  C  D
0       0       0  0  0
1       0       0  0  0
2  one123  twosde  0  0
3       0       0  0  0
4       0       0  0  0
5  one324  two234  0  0
6       0       0  0  0
7       0       0  0  0

But instead of passing individual strings 'one123','one324','twosde','two234', I want to pass a list containing partial strings like this one:

startstrings = ['one', 'two']

keep = df.iloc[:,:].contains(startstrings)
df.iloc[:,:] = df.iloc[:,:].where(keep, 0)
print(df)

But above will not work. I want to keep all contents which start with 'one' or 'two'.

Any idea how to implement? My data set is huge and hence efficiency matters.

Would you have any other column by a name other than dx1-dx99 or px1-px99? — Divakar
– Divakar, Commented Apr 3, 2017 at 16:52
Hey - don't have time to post a full answer, but have a look at this answer and the associated documentation for numpy.in1d stackoverflow.com/questions/19549634/… docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html — Chuck
– Chuck, Commented Apr 3, 2017 at 16:57
Divakar, I subset dataframe to have dx1-dx99 and px1-px99 columns. I have many but those don't need to be operated like these. — Sanoj
– Sanoj, Commented Apr 3, 2017 at 19:39

Kewl · Accepted Answer · 2017-04-03 19:14:32Z

3

The pandas str.contains accepts regular expressions, which let's you test for any item in a list. Loop through each column and use str.contains:

startstrings = ['one', 'two']
pattern = '|'.join(startstrings)

for col in df:
    if all(df[col].apply(type) == str):
        #Set any values to 0 if they don't contain value
        df.ix[~df[col].str.contains(pattern), col] = 0        
    else:
        #Column is not all strings
        df[col] = 0

Produces:

      A     B  C  D
0     0  one1  0  0
1     0  one1  0  0
2  one1  two1  0  0
3     0     0  0  0
4     0  two1  0  0
5  one1  two1  0  0
6     0  one1  0  0
7     0     0  0  0

edited Apr 3, 2017 at 19:14

answered Apr 3, 2017 at 16:30

Kewl

3,4376 gold badges30 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Sanoj Over a year ago

I have 99 DX1-Dx99 & 99 Px1-Px99 Columns, in all 198 columns, instead of 'A' and 'B' only. Writing those column names, like df['A].str.contains(pattern), are not feasible. Therefore is there a way to dynamically pass these filters on whole data frame irrespective of columns. Since in the data frame I can seperate out necessary columns.

Kewl Over a year ago

Okay edited my answer, should work for arbitrary numbers of columns (also fixed it, before I thought you wanted the whole row, fixed that)

Sanoj Over a year ago

Any idea, why am I getting this error with my original dataset? All columns are either object or int64 types: TypeError: bad operand type for unary ~: 'float'

Kewl Over a year ago

My guess would be because some of the columns are of type 'obj' but not strings, which would change them to nan when you try to do a str operation. I've changed the code above to make it more general and explicitly check if all entries in the column are strings, instead of just checking if they are dtype 'O', that that.

Sanoj Over a year ago

In my dataframe all columns are either object or int64. Although I didn't get error after your latest change. I was getting all ZEROs in my result set. Then I forced them to "df2.applymap(str)" but still getting all ZEROs.

|

Divakar · Accepted Answer · 2017-04-03 17:09:20Z

0

Here's a NumPy vectorized approach -

# From http://stackoverflow.com/a/39045337/3293881
def slicer_vectorized(a,start,end):
    b = a.view('S1').reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype='S'+str(end-start))

def isin_chars(df, startstrings, start=0, stop = 3):
    a = df.values.astype(str)
    ss_arr = np.sort(startstrings)
    a_S3 = slicer_vectorized(a.ravel(), start, stop)
    idx = np.searchsorted(ss_arr, a_S3)
    mask = (a_S3 == ss_arr[idx]).reshape(a.shape)
    return df.mask(~mask,0)

def process(df, startstrings, n = 100):
    dx_names = ['dx'+str(i) for i in range(1,n)]
    px_names = ['px'+str(i) for i in range(1,n)]
    all_names = np.hstack((dx_names, px_names))
    df0 = df[all_names]
    df_out = isin_chars(df0, startstrings, start=0, stop = 3)
    return df_out

Sample run -

In [245]: df
Out[245]: 
    dx1    dx2  px1  px2  0
0   foo   one1    0    0  0
1   bar   one1    1    2  7
2  one1   two1    2    4  3
3   bar  three    3    6  8
4   foo   two1    4    8  1
5  one1   two1    5   10  8
6   foo   one1    6   12  6
7   foo  three    7   14  6

In [246]: startstrings = ['two', 'one']

In [247]: process(df, startstrings, n = 3) # change n = 100 for actual case
Out[247]: 
    dx1   dx2  px1  px2
0     0  one1    0    0
1     0  one1    0    0
2  one1  two1    0    0
3     0     0    0    0
4     0  two1    0    0
5  one1  two1    0    0
6     0  one1    0    0
7     0     0    0    0

answered Apr 3, 2017 at 17:09

Divakar

222k19 gold badges273 silver badges374 bronze badges

3 Comments

Sanoj Over a year ago

I have Dx1-Dx99 & Px1-Px99. Therefore I took only one line of code where it is trying to match first three characters and I am getting this error:: ValueError: axis(=-1) out of bounds

Divakar Over a year ago

@Sanoj Edit dx_names = ['Dx'+str(i) for i in range(1,n)] and px_names = ['Px'+str(i) for i in range(1,n)] and see how it goes?

Sanoj Over a year ago

Sorry, I had all in small case: dx1-dx99. Sorry afor type earlier. And I was getting error.

aghast · Accepted Answer · 2017-04-03 17:31:08Z

0

This is kind of brute-force-ish, but it allows for different-length prefix strings, as shown. I modified your example to look for ['one1', 'th'] to show the differing lengths. Not sure if that's something you need.

import numpy as np
import pandas as pd

df = pd.DataFrame({'A': 'foo bar one1 bar foo one1 foo foo'.split(),
                   'B': 'one1 one1 two1 three two1 two1 one1 three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})

prefixes = "one1 th".split()

matches = np.full(df.shape, False, dtype=bool)

for pfx in prefixes:
    for i,col in enumerate(df.columns):
        try:
            matches[:,i] |= df[col].str.startswith(pfx)
        except AttributeError as e:
            # Some columns have no strings
            pass

keep = df.where(matches, 0)
print(keep)

Running this, I get:

$ python test.py
      A      B  C  D
0     0   one1  0  0
1     0   one1  0  0
2  one1      0  0  0
3     0  three  0  0
4     0      0  0  0
5  one1      0  0  0
6     0   one1  0  0
7     0  three  0  0

answered Apr 3, 2017 at 17:31

aghast

15.4k4 gold badges31 silver badges58 bronze badges

3 Comments

Sanoj Over a year ago

Getting this error: <ipython-input-75-b28fd1fff9be> in <module>() 44 for i,col in enumerate(df2.columns): 45 try: ---> 46 matches[:,i] |= df2[col].str.startswith(pfx) 47 except AttributeError as e: 48 # Some columns have no strings TypeError: ufunc 'bitwise_or' output (typecode 'O') could not be coerced to provided output parameter (typecode '?') according to the casting rule ''same_kind''

aghast Over a year ago

Obviously it works for me. Maybe try expanding the line: matches[:,i] = matches[:,i] | df[col].str.startswith(pfx) ? What versions of numpy/pandas are you on?

Sanoj Over a year ago

Python 3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]

Collectives™ on Stack Overflow

Filter pandas (python) dataframe based on partial strings in a list

3 Answers 3

9 Comments

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related