Python Pandas removing substring using another column

Question

I've tried searching around and can't figure out an easy way to do this, so I'm hoping your expertise can help.

I have a pandas data frame with two columns

import numpy as np
import pandas as pd

pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
    'FIRST', np.nan, 'NAME2', 'NAME3', 
    'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})

which gives me

          FULL_NAME   NAME
0        FIRST LAST  FIRST
1               NaN    NaN
2        FIRST LAST  NAME2
3       FIRST NAME3  NAME3
4  FIRST NAME4 LAST  NAME4
5      ANOTHER NAME  NAME5
6         LAST NAME  NAME6

what I'd like to do is take the values from the 'NAME' column and remove then from the 'FULL NAME' column if it's there. So the function would then return

          FULL_NAME   NAME           NEW
0        FIRST LAST  FIRST          LAST
1               NaN    NaN           NaN
2        FIRST LAST  NAME2    FIRST LAST
3       FIRST NAME3  NAME3         FIRST
4  FIRST NAME4 LAST  NAME4    FIRST LAST
5      ANOTHER NAME  NAME5  ANOTHER NAME
6         LAST NAME  NAME6     LAST NAME

So far, I've defined a function below and am using the apply method. This runs rather slow on my large data set though and I'm hoping there's a more efficient way to do it. Thanks!

def address_remove(x):
    try:
        newADDR1 = re.sub(x['NAME'], '', x[-1])
        newADDR1 = newADDR1.rstrip()
        newADDR1 = newADDR1.lstrip()
        return newADDR1
    except:
        return x[-1]

johnchase · Accepted Answer · 2016-01-13 19:06:24Z

10

Here is one solution that is quite a bit faster than your current solution, I'm not convinced that there wouldn't be something faster though

In [13]: import numpy as np
         import pandas as pd
         n = 1000
         testing  = pd.DataFrame({'NAME':[
         'FIRST', np.nan, 'NAME2', 'NAME3', 
         'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST  LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})

This is kind of a long one liner but it should do what you need

Fasted solution I can come up with is using replace as mentioned in another answer:

In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop

Original answer:

In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop

compared to your current solution:

In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop

These get you the same answer as your current solution

edited Jan 13, 2016 at 19:06

answered Jan 13, 2016 at 18:58

johnchase

13.8k7 gold badges44 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Link Over a year ago

great! I was trying to come up with the 2nd solution, but the third one is even better! Would you mind telling me what the "zip" command is doing though?

johnchase Over a year ago

Glad that worked! zip takes multiple iterables and returns an iterator of the aggregate from the original iterables. In more lay terms it allows you to loop through two or more iterables simultaneously. docs.python.org/3/library/functions.html#zip

Anton Protopopov · Accepted Answer · 2016-01-13 20:13:14Z

6

You could do it with replace method and regex argument and then use str.strip:

In [605]: testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
Out[605]: 
0            LAST
1             NaN
2      FIRST LAST
3           FIRST
4     FIRST  LAST
5    ANOTHER NAME
6       LAST NAME
Name: FULL_NAME, dtype: object

Note You need to pass notnull to testing.NAME because without it NaN values also will be replaced to empty string

Benchmarking is slower then fastest @johnchase solution but I think it's more readable and use all pandas methods of DataFrames and Series:

In [607]: %timeit testing['NEW'] = testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
100 loops, best of 3: 4.56 ms per loop

In [661]: %timeit testing ['NEW'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
1000 loops, best of 3: 450 µs per loop

edited Jan 13, 2016 at 20:13

answered Jan 13, 2016 at 19:32

Anton Protopopov

31.9k13 gold badges93 silver badges96 bronze badges

4 Comments

floydn Over a year ago

pure pandas solution. good work. definitely easier to read, even if it wasn't faster.

Anton Protopopov Over a year ago

@johnchase yes, sorry. It's for less typing in console

johnchase Over a year ago

Yep, I did the exact same thing at first too. Also what size is the dataframe for your test? I'm getting pretty different timing results running your code, though I'm wondering if it's something I'm doing...

Anton Protopopov Over a year ago

@johnchase Yes, your solution is almost 10 times faster. I have a more powerfull PC :)

Dan · Accepted Answer · 2016-01-13 19:02:02Z

I think you want to use the replace() method that strings have, it's orders of magnitude faster than using regular expressions (I just checked quickly in IPython):

%timeit mystr.replace("ello", "")
The slowest run took 7.64 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 250 ns per loop

%timeit re.sub("ello","", "e")
The slowest run took 21.03 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 4.7 µs per loop

If you need further speed improvements after that, you should look into numpy's vectorize function (but I think the speed up from using replace instead of regular expressions should be pretty substantial).

Collectives™ on Stack Overflow

Python Pandas removing substring using another column

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related