extracting dates using Regex in python

Question

I want to extract year from my Data Frame column data3['CopyRight'].

CopyRight
2015 Sony Music Entertainment
2015 Ultra Records , LLC under exclusive license
2014 , 2015 Epic Records , a division of Sony Music Entertainment
Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment

I am using the below code to extract the year :

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+)', expand=False).str.strip()

with my Code I am only getting the First occurrence of year.

CopyRight_year
2015
2015
2014
2014
2014
2014

I want to extract all the years mentioned in the column.

Expected Output

CopyRight_year
    2015
    2015
    2014,2015
    2014
    2014,2015
    2014,2015

jezrael · Accepted Answer · 2019-02-24 09:29:11Z

1

Use findall with regex for find all integers with length 4 to lists and last join it by separator:

Thank you @Wiktor Stribiżew for idea add word boundary r'\b\d{4}\b':

data3['CopyRight_year'] = data3['CopyRight'].str.findall(r'\b\d{4}\b').str.join(',')
print (data3)
                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

edited Feb 24, 2019 at 9:29

answered Feb 24, 2019 at 8:58

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wiktor Stribiżew Over a year ago

I would use r'\b\d{4}\b', since '(\d{4})' will match 4-digit chunks even inside longer digit chunks (e.g. 0067 in 006789).

Aditya Sharma Over a year ago

@jezrael - Thanks a lot I am getting the expected output.

Pushpesh Kumar Rajwanshi · Accepted Answer · 2019-02-24 10:06:05Z

Your current regex will just capture the digit, and if you want to capture the comma separated years, then you will need to enhance your regex to this,

[0-9]+(?:\s+,\s+[0-9]+)*

This regex [0-9]+ will match the numbers and additionally (?:\s+,\s+[0-9]+)* regex will match one or more whitespace followed by a comma and again followed by one or more whitespace and then finally a number and whole of it zero or more times as available in the data.

Demo

Change your panda dataframe line to this,

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+(?:\s+,\s+[0-9]+)*)', expand=False).str.replace('\s+','')

Prints,

                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a 1999 division of ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

Although I liked jezrael answer which uses findall and join which gives you more flexibility and cleaner approach.

Collectives™ on Stack Overflow

extracting dates using Regex in python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related