Remove partial string from dataframe with Pandas

Question

If I have a dataframe like this:

id    str
01    abc_d(a)
02    ab_d(a)
03    abcd_e(a)
04    a_b(a)

How can i get a dataframe as following ? Sorry i makeup this dataframe to represent my real issues. Thanks.

id    str
01    d
02    d
03    e
04    b

I try, but I can only accept one answer. When i click others, the previous one become gray... — ah bon
– ah bon, Commented Jun 7, 2018 at 9:09

cs95 · Accepted Answer · 2018-06-07 02:06:53Z

4

(Bad Answer)

`Series.str.split` soup

df['str'] = df['str'].str.split('(').str[0].str.split('_').str[-1]    
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

(Less Bad answer)

`Series.str.extract`

df['str'] = df['str'].str.extract(r'_([^_]+)\(', expand=False)
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

Regex methods come with their fair share of overhead, and str.extract does not do much to make things better.

(Better Answer)

`re.search` with list comp

import re

p = re.compile(r'(?<=_)[^_]+(?=\()')
df['str'] = [p.search(x)[0] for x in df['str'].tolist()] 
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

This should be faster than the above methods. I find list comprehensions are really fast compared to most vectorised string pandas methods, even if this does use regex. I pre-compile the pattern in advance to alleviate some of the performance concerns.

(Also a better answer)

`str.split` with list comp

df['str'] = [
    x.split('(', 1)[0].split('_')[1] for x in df['str'].tolist()
]
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

This combines the best of both worlds, the performance of a list comp and the speed of pure python string splitting. Should be the fastest.

Performance

df_test = pd.concat([df] * 10000, ignore_index=True)

%timeit df_test['str'].str.extract(r'_([^_]+)\(', expand=False)
%timeit df_test['str'].str.split('(').str[0].str.split('_').str[-1] 
%timeit [p.search(x)[0] for x in df_test['str'].tolist()] 
%timeit [x.split('(', 1)[0].split('_')[1] for x in df_test['str'].tolist()]

70.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
99.6 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
31 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # fastest but not by much

edited Jun 7, 2018 at 2:06

answered Jun 7, 2018 at 1:35

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

sudonym Over a year ago

again coldspeed! if you have time, could you shorlty include an explanation what makes the answers good, bad or better if they convey the same result? I am a beginner and don't see the immediate difference in terms of quality with the exception that your best answer seems be the worst in terms of readability to me (no offence)

cs95 Over a year ago

@sudonym yes, give me a minute, I'll add some more info ;-)

sudonym Over a year ago

as usual, I am impressed - enjoy your day!

jpp Over a year ago

I'm not sure why you'd choose str.split('(').str[0].str.split('_').str[-1] over str.split('_').str[-1].str[0]. This doesn't invalidate the rest of your answer, but there's no need to make a splitting approach seem worse than it is.

cs95 Over a year ago

@jpp not great for readability but possibly better in terms of lesser number of splits, depending on your string.

|

BENY · Accepted Answer · 2018-06-07 01:24:41Z

3

Using extract

df['str']=df['str'].str.extract("\_(.*)\(",expand=True) 
df
Out[585]: 
   id str
0   1   d
1   2   d
2   3   e
3   4   b

answered Jun 7, 2018 at 1:24

BENY

324k22 gold badges176 silver badges250 bronze badges

Comments

niraj · Accepted Answer · 2018-06-07 01:27:25Z

1

May be you can try split similar to example:

df['str'] = df['str'].str.split('_').str.get(1).str[0]

Or,

df['str'] = df['str'].str.split('_').str.get(1).str.split('(').str[0]

answered Jun 7, 2018 at 1:27

niraj

18.2k4 gold badges36 silver badges50 bronze badges

Comments

jpp · Accepted Answer · 2018-06-07 01:27:29Z

1

Using pd.Series.str.split. Specific to your particular format.

df['str'] = df['str'].str.split('_').str[-1].str[0]

print(df)

   id str
0   1   d
1   2   d
2   3   e
3   4   b

answered Jun 7, 2018 at 1:27

jpp

166k37 gold badges301 silver badges363 bronze badges

Collectives™ on Stack Overflow

Remove partial string from dataframe with Pandas

4 Answers 4

`Series.str.split` soup

`Series.str.extract`

`re.search` with list comp

`str.split` with list comp

7 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Series.str.split soup

Series.str.extract

re.search with list comp

str.split with list comp

7 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

`Series.str.split` soup

`Series.str.extract`

`re.search` with list comp

`str.split` with list comp