1

I created a dataframe df where I have a column with the following values:

category
20150115_Holiday_HK_Misc
20150115_Holiday_SG_Misc
20140116_DE_ProductFocus
20140116_UK_ProductFocus

I want to create 3 new columns

category                  |           A              |  B  |       C     
20150115_Holiday_HK_Misc     20150115_Holiday_Misc     HK    Holiday_Misc 
20150115_Holiday_SG_Misc     20150115_Holiday_Misc     SG    Holiday_Misc
20140116_DE_ProductFocus     20140116_ProductFocus     DE    ProductFocus
20140116_UK_ProductFocus     20140116_ProductFocus     UK    ProductFocus

In column A, I want to take out "_HK" - I think I need to manually code this, but this is fine, I have the list of all country codes

In column B, it's that very country code

Column C, is column A without the date in the beginning

I am trying something like this, but not getting far.

 df['B'] = np.where([df['category'].str.contains("HK")==True], 'HK', 'Not Specified')

Thank you

2
  • I'm thinking about some string methods like .split() for example Commented Feb 24, 2017 at 19:37
  • Except your strings aren't all structured the same way, so it doesn't get you exactly where you want to be. Commented Feb 24, 2017 at 19:39

2 Answers 2

5

you can use Series.str.extract() method:

# remove two characters (Country Code) surrounded by '_'
df['A'] = df.category.str.replace(r'_\w{2}_', '_')
# extract two characters (Country Code) surrounded by '_' 
df['B'] = df.category.str.extract(r'_(\w{2})_', expand=False)
df['C'] = df.A.str.extract(r'\d+_(.*)', expand=False)

Result:

In [148]: df
Out[148]:
                   category                      A   B             C
0  20150115_Holiday_HK_Misc  20150115_Holiday_Misc  HK  Holiday_Misc
1  20150115_Holiday_SG_Misc  20150115_Holiday_Misc  SG  Holiday_Misc
2  20140116_DE_ProductFocus  20140116_ProductFocus  DE  ProductFocus
3  20140116_UK_ProductFocus  20140116_ProductFocus  UK  ProductFocus
Sign up to request clarification or add additional context in comments.

2 Comments

Using column A.extract for C is quite smart. Makes the regex much more readable.
It looks beautiful :)
1

You can also use regex and apply

import re
df['A'] = df.category.apply(lambda x:re.sub(r'(.*)_(\w\w)_(.*)', r'\1_\3', x))
df['B'] = df.category.apply(lambda x:re.sub(r'(.*)_(\w\w)_(.*)', r'\2', x))
df['C'] = df.A.apply(lambda x:re.sub(r'(\d+)_(.*)', r'\2', x))

Result

                   category                      A   B             C
0  20150115_Holiday_HK_Misc  20150115_Holiday_Misc  HK  Holiday_Misc
1  20150115_Holiday_SG_Misc  20150115_Holiday_Misc  SG  Holiday_Misc
2  20140116_DE_ProductFocus  20140116_ProductFocus  DE  ProductFocus
3  20140116_UK_ProductFocus  20140116_ProductFocus  UK  ProductFocus

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.