3

I have a pandas data frame with 99 columns of dx1-dx99 & 99 columns of px1-px99. The contents of these columns are codes of varying length of 4 to 8 characters & digits.

I want to filter only those contents, from these columns, where first three characters of these contents match the three characters in the supplied list. Supplied list has strings that have only three characters.

The length of supplied list I generated dynamically and very length. Therefore I have to pass this whole list not as a separate string.

For example, I have this data frame:

df = pd.DataFrame({'A': 'foo bar one123 bar foo one324 foo 0'.split(),
                   'B': 'one546 one765 twosde three twowef two234 onedfr three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
    print(df)

        A       B  C   D
0     foo  one546  0   0
1       0  one765  1   2
2  one123  twosde  2   4
3     bar   three  3   6
4     foo  twowef  4   8
5  one324  two234  5  10
6     foo  onedfr  6  12
7       0   three  7  14

The filled cells are in object type and all zeros were originally NULL which I have filled with zeros by pd.fillna(0).

When I do this:

keep = df.iloc[:,:].isin(['one123','one324','twosde','two234']).values
df.iloc[:,:] = df.iloc[:,:].where(keep, 0)
print(df)

I got this:

        A       B  C  D
0       0       0  0  0
1       0       0  0  0
2  one123  twosde  0  0
3       0       0  0  0
4       0       0  0  0
5  one324  two234  0  0
6       0       0  0  0
7       0       0  0  0

But instead of passing individual strings 'one123','one324','twosde','two234', I want to pass a list containing partial strings like this one:

startstrings = ['one', 'two']

keep = df.iloc[:,:].contains(startstrings)
df.iloc[:,:] = df.iloc[:,:].where(keep, 0)
print(df)

But above will not work. I want to keep all contents which start with 'one' or 'two'.

Any idea how to implement? My data set is huge and hence efficiency matters.

3
  • Would you have any other column by a name other than dx1-dx99 or px1-px99? Commented Apr 3, 2017 at 16:52
  • Hey - don't have time to post a full answer, but have a look at this answer and the associated documentation for numpy.in1d stackoverflow.com/questions/19549634/… docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html Commented Apr 3, 2017 at 16:57
  • Divakar, I subset dataframe to have dx1-dx99 and px1-px99 columns. I have many but those don't need to be operated like these. Commented Apr 3, 2017 at 19:39

3 Answers 3

3

The pandas str.contains accepts regular expressions, which let's you test for any item in a list. Loop through each column and use str.contains:

startstrings = ['one', 'two']
pattern = '|'.join(startstrings)

for col in df:
    if all(df[col].apply(type) == str):
        #Set any values to 0 if they don't contain value
        df.ix[~df[col].str.contains(pattern), col] = 0        
    else:
        #Column is not all strings
        df[col] = 0

Produces:

      A     B  C  D
0     0  one1  0  0
1     0  one1  0  0
2  one1  two1  0  0
3     0     0  0  0
4     0  two1  0  0
5  one1  two1  0  0
6     0  one1  0  0
7     0     0  0  0
Sign up to request clarification or add additional context in comments.

9 Comments

I have 99 DX1-Dx99 & 99 Px1-Px99 Columns, in all 198 columns, instead of 'A' and 'B' only. Writing those column names, like df['A].str.contains(pattern), are not feasible. Therefore is there a way to dynamically pass these filters on whole data frame irrespective of columns. Since in the data frame I can seperate out necessary columns.
Okay edited my answer, should work for arbitrary numbers of columns (also fixed it, before I thought you wanted the whole row, fixed that)
Any idea, why am I getting this error with my original dataset? All columns are either object or int64 types: TypeError: bad operand type for unary ~: 'float'
My guess would be because some of the columns are of type 'obj' but not strings, which would change them to nan when you try to do a str operation. I've changed the code above to make it more general and explicitly check if all entries in the column are strings, instead of just checking if they are dtype 'O', that that.
In my dataframe all columns are either object or int64. Although I didn't get error after your latest change. I was getting all ZEROs in my result set. Then I forced them to "df2.applymap(str)" but still getting all ZEROs.
|
0

Here's a NumPy vectorized approach -

# From http://stackoverflow.com/a/39045337/3293881
def slicer_vectorized(a,start,end):
    b = a.view('S1').reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype='S'+str(end-start))

def isin_chars(df, startstrings, start=0, stop = 3):
    a = df.values.astype(str)
    ss_arr = np.sort(startstrings)
    a_S3 = slicer_vectorized(a.ravel(), start, stop)
    idx = np.searchsorted(ss_arr, a_S3)
    mask = (a_S3 == ss_arr[idx]).reshape(a.shape)
    return df.mask(~mask,0)

def process(df, startstrings, n = 100):
    dx_names = ['dx'+str(i) for i in range(1,n)]
    px_names = ['px'+str(i) for i in range(1,n)]
    all_names = np.hstack((dx_names, px_names))
    df0 = df[all_names]
    df_out = isin_chars(df0, startstrings, start=0, stop = 3)
    return df_out

Sample run -

In [245]: df
Out[245]: 
    dx1    dx2  px1  px2  0
0   foo   one1    0    0  0
1   bar   one1    1    2  7
2  one1   two1    2    4  3
3   bar  three    3    6  8
4   foo   two1    4    8  1
5  one1   two1    5   10  8
6   foo   one1    6   12  6
7   foo  three    7   14  6

In [246]: startstrings = ['two', 'one']

In [247]: process(df, startstrings, n = 3) # change n = 100 for actual case
Out[247]: 
    dx1   dx2  px1  px2
0     0  one1    0    0
1     0  one1    0    0
2  one1  two1    0    0
3     0     0    0    0
4     0  two1    0    0
5  one1  two1    0    0
6     0  one1    0    0
7     0     0    0    0

3 Comments

I have Dx1-Dx99 & Px1-Px99. Therefore I took only one line of code where it is trying to match first three characters and I am getting this error:: ValueError: axis(=-1) out of bounds
@Sanoj Edit dx_names = ['Dx'+str(i) for i in range(1,n)] and px_names = ['Px'+str(i) for i in range(1,n)] and see how it goes?
Sorry, I had all in small case: dx1-dx99. Sorry afor type earlier. And I was getting error.
0

This is kind of brute-force-ish, but it allows for different-length prefix strings, as shown. I modified your example to look for ['one1', 'th'] to show the differing lengths. Not sure if that's something you need.

import numpy as np
import pandas as pd

df = pd.DataFrame({'A': 'foo bar one1 bar foo one1 foo foo'.split(),
                   'B': 'one1 one1 two1 three two1 two1 one1 three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})

prefixes = "one1 th".split()

matches = np.full(df.shape, False, dtype=bool)

for pfx in prefixes:
    for i,col in enumerate(df.columns):
        try:
            matches[:,i] |= df[col].str.startswith(pfx)
        except AttributeError as e:
            # Some columns have no strings
            pass

keep = df.where(matches, 0)
print(keep)

Running this, I get:

$ python test.py
      A      B  C  D
0     0   one1  0  0
1     0   one1  0  0
2  one1      0  0  0
3     0  three  0  0
4     0      0  0  0
5  one1      0  0  0
6     0   one1  0  0
7     0  three  0  0

3 Comments

Getting this error: <ipython-input-75-b28fd1fff9be> in <module>() 44 for i,col in enumerate(df2.columns): 45 try: ---> 46 matches[:,i] |= df2[col].str.startswith(pfx) 47 except AttributeError as e: 48 # Some columns have no strings TypeError: ufunc 'bitwise_or' output (typecode 'O') could not be coerced to provided output parameter (typecode '?') according to the casting rule ''same_kind''
Obviously it works for me. Maybe try expanding the line: matches[:,i] = matches[:,i] | df[col].str.startswith(pfx) ? What versions of numpy/pandas are you on?
Python 3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.