python pandas partial string match

Question

I created a dataframe df where I have a column with the following values:

category
20150115_Holiday_HK_Misc
20150115_Holiday_SG_Misc
20140116_DE_ProductFocus
20140116_UK_ProductFocus

I want to create 3 new columns

category                  |           A              |  B  |       C     
20150115_Holiday_HK_Misc     20150115_Holiday_Misc     HK    Holiday_Misc 
20150115_Holiday_SG_Misc     20150115_Holiday_Misc     SG    Holiday_Misc
20140116_DE_ProductFocus     20140116_ProductFocus     DE    ProductFocus
20140116_UK_ProductFocus     20140116_ProductFocus     UK    ProductFocus

In column A, I want to take out "_HK" - I think I need to manually code this, but this is fine, I have the list of all country codes

In column B, it's that very country code

Column C, is column A without the date in the beginning

I am trying something like this, but not getting far.

 df['B'] = np.where([df['category'].str.contains("HK")==True], 'HK', 'Not Specified')

Thank you

I'm thinking about some string methods like .split() for example — AsheKetchum
– AsheKetchum, Commented Feb 24, 2017 at 19:37
Except your strings aren't all structured the same way, so it doesn't get you exactly where you want to be. — AsheKetchum
– AsheKetchum, Commented Feb 24, 2017 at 19:39

MaxU - stand with Ukraine · Accepted Answer · 2017-02-24 19:52:36Z

5

you can use Series.str.extract() method:

# remove two characters (Country Code) surrounded by '_'
df['A'] = df.category.str.replace(r'_\w{2}_', '_')
# extract two characters (Country Code) surrounded by '_' 
df['B'] = df.category.str.extract(r'_(\w{2})_', expand=False)
df['C'] = df.A.str.extract(r'\d+_(.*)', expand=False)

Result:

In [148]: df
Out[148]:
                   category                      A   B             C
0  20150115_Holiday_HK_Misc  20150115_Holiday_Misc  HK  Holiday_Misc
1  20150115_Holiday_SG_Misc  20150115_Holiday_Misc  SG  Holiday_Misc
2  20140116_DE_ProductFocus  20140116_ProductFocus  DE  ProductFocus
3  20140116_UK_ProductFocus  20140116_ProductFocus  UK  ProductFocus

edited Feb 24, 2017 at 19:52

answered Feb 24, 2017 at 19:37

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vaishali Over a year ago

Using column A.extract for C is quite smart. Makes the regex much more readable.

AsheKetchum Over a year ago

It looks beautiful :)

Yuval Atzmon · Accepted Answer · 2017-02-24 19:50:13Z

1

You can also use regex and apply

import re
df['A'] = df.category.apply(lambda x:re.sub(r'(.*)_(\w\w)_(.*)', r'\1_\3', x))
df['B'] = df.category.apply(lambda x:re.sub(r'(.*)_(\w\w)_(.*)', r'\2', x))
df['C'] = df.A.apply(lambda x:re.sub(r'(\d+)_(.*)', r'\2', x))

Result

                   category                      A   B             C
0  20150115_Holiday_HK_Misc  20150115_Holiday_Misc  HK  Holiday_Misc
1  20150115_Holiday_SG_Misc  20150115_Holiday_Misc  SG  Holiday_Misc
2  20140116_DE_ProductFocus  20140116_ProductFocus  DE  ProductFocus
3  20140116_UK_ProductFocus  20140116_ProductFocus  UK  ProductFocus

answered Feb 24, 2017 at 19:50

Yuval Atzmon

6,0034 gold badges48 silver badges76 bronze badges

Collectives™ on Stack Overflow

python pandas partial string match

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related