5

I want to extract year from my Data Frame column data3['CopyRight'].

CopyRight
2015 Sony Music Entertainment
2015 Ultra Records , LLC under exclusive license
2014 , 2015 Epic Records , a division of Sony Music Entertainment
Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment

I am using the below code to extract the year :

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+)', expand=False).str.strip()

with my Code I am only getting the First occurrence of year.

CopyRight_year
2015
2015
2014
2014
2014
2014

I want to extract all the years mentioned in the column.

Expected Output

CopyRight_year
    2015
    2015
    2014,2015
    2014
    2014,2015
    2014,2015

2 Answers 2

1

Use findall with regex for find all integers with length 4 to lists and last join it by separator:

Thank you @Wiktor Stribiżew for idea add word boundary r'\b\d{4}\b':

data3['CopyRight_year'] = data3['CopyRight'].str.findall(r'\b\d{4}\b').str.join(',')
print (data3)
                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
Sign up to request clarification or add additional context in comments.

2 Comments

I would use r'\b\d{4}\b', since '(\d{4})' will match 4-digit chunks even inside longer digit chunks (e.g. 0067 in 006789).
@jezrael - Thanks a lot I am getting the expected output.
1

Your current regex will just capture the digit, and if you want to capture the comma separated years, then you will need to enhance your regex to this,

[0-9]+(?:\s+,\s+[0-9]+)*

This regex [0-9]+ will match the numbers and additionally (?:\s+,\s+[0-9]+)* regex will match one or more whitespace followed by a comma and again followed by one or more whitespace and then finally a number and whole of it zero or more times as available in the data.

Demo

Change your panda dataframe line to this,

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+(?:\s+,\s+[0-9]+)*)', expand=False).str.replace('\s+','')

Prints,

                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a 1999 division of ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

Although I liked jezrael answer which uses findall and join which gives you more flexibility and cleaner approach.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.