slice a string by different characters using Python Pandas

Question

how to slice a string in dataframe, start from left, based on different characters, such as ' /- . , I only want the first time this character shows up.

key   name
1   McDonald's
2   CVS/PHARMACY
3   CVS/Store
4   WAL-MART
5   AMAZON.CO

expect result:

key   name            for_Group
1   McDonald's        McDonald
2   CVS/PHARMACY         CVS
3   CVS/Store            CVS
4   WAL-MART             WAL
5   AMAZON.CO          AMAZON

I'm not sure if this need to use regular expression?

user3483203 · Accepted Answer · 2018-06-15 21:39:00Z

4

Option 1
str.split with expand=True

df['for_group'] = df.name.str.split(r"[\'\/\-\.]", expand=True)[0]

   key          name for_group
0    1    McDonald's  McDonald
1    2  CVS/PHARMACY       CVS
2    3     CVS/Store       CVS
3    4      WAL-MART       WAL
4    5     AMAZON.CO    AMAZON

Option 2 (Best option)
str.extract (I personally prefer this one, it matches until it finds one of your desired stop characters)

df.name.str.extract(r'(.*?)[\'\/\-\.]', expand=False)

0    McDonald
1         CVS
2         CVS
3         WAL
4      AMAZON

The second option here is much faster:

df = pd.concat([df]*10000)

%timeit df.name.str.split(r"[\'\/\-\.]", expand=True)[0]
141 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.name.str.extract(r'(.*)[\'\/\-\.]', expand=False)
72.6 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Jun 15, 2018 at 21:39

answered Jun 15, 2018 at 21:22

user3483203

51.3k10 gold badges72 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Learn Over a year ago

Be careful with str.extract, I realized that its accuracy rate is bad, when I test it on 300k rows of dataset, In Contract, str.split is 100% accuracy.

user3483203 Over a year ago

@Learn I'm guessing you were using (.*)[\'\/\-\.] as the regex with extract. I updated it to (.*?)[\'\/\-\.] which should fix bad matches. I would recommend trying the current version in my answer and seeing if that works better for you!

Learn Over a year ago

after your update, both of them hit 100% accuracy, Bingo!

user3483203 Over a year ago

A bit of explanation, (.*)[\'\/\-\.] will match from the beginning of your string until the last delimiter, but you want lazy matching and by using (.*?) instead of (.*) it only matches until the first occurence of a delimiter

harpan · Accepted Answer · 2018-06-15 21:44:14Z

2

Method 1

You can use the below regular expression, which means a word character (a-z etc.) repeated one or more times. This returns an array and you can take the first element off it.

import re
df['for_group'] = df['name'].apply(lambda x: re.findall(r"[\w]+", x)[0])

A faster approach to regular expression would be to use .search() as pointed out by @user3483203

df['for_group'] = df['name'].apply(lambda x: re.search(r"[\w]+", x).group())

Method 2

Similarly, you can use:

df['for_group'] = df.name.str.split('\W+').apply(lambda x: x[0])

Output:

    key          name for_group
0    1    McDonald's  McDonald
1    2  CVS/PHARMACY       CVS
2    3     CVS/Store       CVS
3    4      WAL-MART       WAL
4    5     AMAZON.CO    AMAZON

edited Jun 15, 2018 at 21:44

answered Jun 15, 2018 at 21:23

harpan

8,6412 gold badges22 silver badges40 bronze badges

8 Comments

user3483203 Over a year ago

Your first method can be made considerably faster by using search as opposed to findall, so df['name'].apply(lambda x: re.search(r"[\w]+", x).group())

harpan Over a year ago

@user3483203, you are absolutely right. For large dataframe, .findall() takes 73.8 ms ± 1.22 ms time while .search() takes 60.1 ms ± 552 µs. Thanks.

user3483203 Over a year ago

On this dataset, since there will only be a couple results it doesn't matter, but if there were many results it would skew even more. Same reason extract is better than split in my answer. It returns after a single match

harpan Over a year ago

@user3483203, oh I see. Then is it safe to say that if we had long sentences to split instead of couple word string, we are better off using .split() and/or .findall() ?

user3483203 Over a year ago

Sorry, I was unclear. search and extract return after the first match is found, while split and findall will search the entire string. If the strings in the columns were longer search/extract would be even more desireable.

|

Collectives™ on Stack Overflow

slice a string by different characters using Python Pandas

2 Answers 2

4 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related