3

If I have a dataframe like this:

id    str
01    abc_d(a)
02    ab_d(a)
03    abcd_e(a)
04    a_b(a)

How can i get a dataframe as following ? Sorry i makeup this dataframe to represent my real issues. Thanks.

id    str
01    d
02    d
03    e
04    b
1
  • I try, but I can only accept one answer. When i click others, the previous one become gray... Commented Jun 7, 2018 at 9:09

4 Answers 4

4

(Bad Answer)

Series.str.split soup

df['str'] = df['str'].str.split('(').str[0].str.split('_').str[-1]    
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

(Less Bad answer)

Series.str.extract

df['str'] = df['str'].str.extract(r'_([^_]+)\(', expand=False)
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

Regex methods come with their fair share of overhead, and str.extract does not do much to make things better.


(Better Answer)

re.search with list comp

import re

p = re.compile(r'(?<=_)[^_]+(?=\()')
df['str'] = [p.search(x)[0] for x in df['str'].tolist()] 
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

This should be faster than the above methods. I find list comprehensions are really fast compared to most vectorised string pandas methods, even if this does use regex. I pre-compile the pattern in advance to alleviate some of the performance concerns.


(Also a better answer)

str.split with list comp

df['str'] = [
    x.split('(', 1)[0].split('_')[1] for x in df['str'].tolist()
]
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

This combines the best of both worlds, the performance of a list comp and the speed of pure python string splitting. Should be the fastest.


Performance

df_test = pd.concat([df] * 10000, ignore_index=True)

%timeit df_test['str'].str.extract(r'_([^_]+)\(', expand=False)
%timeit df_test['str'].str.split('(').str[0].str.split('_').str[-1] 
%timeit [p.search(x)[0] for x in df_test['str'].tolist()] 
%timeit [x.split('(', 1)[0].split('_')[1] for x in df_test['str'].tolist()]

70.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
99.6 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
31 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # fastest but not by much
Sign up to request clarification or add additional context in comments.

7 Comments

again coldspeed! if you have time, could you shorlty include an explanation what makes the answers good, bad or better if they convey the same result? I am a beginner and don't see the immediate difference in terms of quality with the exception that your best answer seems be the worst in terms of readability to me (no offence)
@sudonym yes, give me a minute, I'll add some more info ;-)
as usual, I am impressed - enjoy your day!
I'm not sure why you'd choose str.split('(').str[0].str.split('_').str[-1] over str.split('_').str[-1].str[0]. This doesn't invalidate the rest of your answer, but there's no need to make a splitting approach seem worse than it is.
@jpp not great for readability but possibly better in terms of lesser number of splits, depending on your string.
|
3

Using extract

df['str']=df['str'].str.extract("\_(.*)\(",expand=True) 
df
Out[585]: 
   id str
0   1   d
1   2   d
2   3   e
3   4   b

Comments

1

May be you can try split similar to example:

df['str'] = df['str'].str.split('_').str.get(1).str[0]

Or,

df['str'] = df['str'].str.split('_').str.get(1).str.split('(').str[0]

Comments

1

Using pd.Series.str.split. Specific to your particular format.

df['str'] = df['str'].str.split('_').str[-1].str[0]

print(df)

   id str
0   1   d
1   2   d
2   3   e
3   4   b

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.