2

how to slice a string in dataframe, start from left, based on different characters, such as ' /- . , I only want the first time this character shows up.

key   name
1   McDonald's
2   CVS/PHARMACY
3   CVS/Store
4   WAL-MART
5   AMAZON.CO

expect result:

key   name            for_Group
1   McDonald's        McDonald
2   CVS/PHARMACY         CVS
3   CVS/Store            CVS
4   WAL-MART             WAL
5   AMAZON.CO          AMAZON

I'm not sure if this need to use regular expression?

2 Answers 2

4

Option 1
str.split with expand=True

df['for_group'] = df.name.str.split(r"[\'\/\-\.]", expand=True)[0]

   key          name for_group
0    1    McDonald's  McDonald
1    2  CVS/PHARMACY       CVS
2    3     CVS/Store       CVS
3    4      WAL-MART       WAL
4    5     AMAZON.CO    AMAZON

Option 2 (Best option)
str.extract (I personally prefer this one, it matches until it finds one of your desired stop characters)

df.name.str.extract(r'(.*?)[\'\/\-\.]', expand=False)

0    McDonald
1         CVS
2         CVS
3         WAL
4      AMAZON

The second option here is much faster:

df = pd.concat([df]*10000)

%timeit df.name.str.split(r"[\'\/\-\.]", expand=True)[0]
141 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.name.str.extract(r'(.*)[\'\/\-\.]', expand=False)
72.6 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

4 Comments

Be careful with str.extract, I realized that its accuracy rate is bad, when I test it on 300k rows of dataset, In Contract, str.split is 100% accuracy.
@Learn I'm guessing you were using (.*)[\'\/\-\.] as the regex with extract. I updated it to (.*?)[\'\/\-\.] which should fix bad matches. I would recommend trying the current version in my answer and seeing if that works better for you!
after your update, both of them hit 100% accuracy, Bingo!
A bit of explanation, (.*)[\'\/\-\.] will match from the beginning of your string until the last delimiter, but you want lazy matching and by using (.*?) instead of (.*) it only matches until the first occurence of a delimiter
2

Method 1

You can use the below regular expression, which means a word character (a-z etc.) repeated one or more times. This returns an array and you can take the first element off it.

import re
df['for_group'] = df['name'].apply(lambda x: re.findall(r"[\w]+", x)[0])

A faster approach to regular expression would be to use .search() as pointed out by @user3483203

df['for_group'] = df['name'].apply(lambda x: re.search(r"[\w]+", x).group())

Method 2

Similarly, you can use:

df['for_group'] = df.name.str.split('\W+').apply(lambda x: x[0])

Output:

    key          name for_group
0    1    McDonald's  McDonald
1    2  CVS/PHARMACY       CVS
2    3     CVS/Store       CVS
3    4      WAL-MART       WAL
4    5     AMAZON.CO    AMAZON

8 Comments

Your first method can be made considerably faster by using search as opposed to findall, so df['name'].apply(lambda x: re.search(r"[\w]+", x).group())
@user3483203, you are absolutely right. For large dataframe, .findall() takes 73.8 ms ± 1.22 ms time while .search() takes 60.1 ms ± 552 µs. Thanks.
On this dataset, since there will only be a couple results it doesn't matter, but if there were many results it would skew even more. Same reason extract is better than split in my answer. It returns after a single match
@user3483203, oh I see. Then is it safe to say that if we had long sentences to split instead of couple word string, we are better off using .split() and/or .findall() ?
Sorry, I was unclear. search and extract return after the first match is found, while split and findall will search the entire string. If the strings in the columns were longer search/extract would be even more desireable.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.