Extracting date using regex on DataFrame?

Question

I want to extracted date from description column to another column. But, I have countered some issues.

This is my DataFrame code:

df = pd.DataFrame({'description':['description: kartu debit 20/10 indomaretcipete r', 'description: tarikan atm 20/10', 
                                 'description: biaya adm', 'description: trsf e-banking db 18/10 wsid:23881 riri indah lestari', 
                                 'description: switching biaya txn di 008 komp clandak armori', 'description: switching withdrawal di 008 komp clandak imori', 
                                 'description: trsf e-banking db tanggal :13/10 13/10 wsid:269b1 dwi ayu mustika', 
                                 'description: trsf e-banking db 1310/ftva/ws269b100240/home credit - - 3800372540', 
                                 'description: kartu debit 09/10 starbuckspasaraya', 'description: byr via e-banking 13/09 wsid46841381200 telkomsel 081293112183 tezar alamsyah', 
                                 'description: switching db biaya txn ke 022 danabijak tezar albank centra', 'description: kartu debit spbu totalterogon'], 
                   'label': ['minimarket', 'atm penarikan', 'administrasi', 'transfer', 'biaya', 'penarikan', 'personal', 
                             'fintech', 'other', 'pulsa', 'biaya fintech', 'fuel']})

and this is the what I have been tried:

for date in df.description:
    date = df.description
    date = re.findall(r'\d{2}/\d{2}', date)

    print(date)

But the output is TypeError: expected string or bytes-like object

Try df['description'].str.extractall(r'(\d{2}/\d{2})') ..? — Chris Adams
– Chris Adams, Commented Aug 7, 2019 at 9:41
In your for loop, the first line for each iteration is assigning the variable date to a pandas the Series 'description'. Don't do this - remove that line. Also I'd suggest giving your iterating variable a different name, instead of using date for example just use for description in df.description: ... then date = re.findall(r'\d{2}/\d{2}', description) — Chris Adams
– Chris Adams, Commented Aug 7, 2019 at 9:50
All the answer below is correct. I love how @political scientist gave me a simple one line code yet it's answer my question, — ebuzz168
– ebuzz168, Commented Aug 8, 2019 at 2:23

Erfan · Accepted Answer · 2019-08-07 09:53:13Z

1

To completely answer your question:

Use str.extractall
Unstack rows to columns
Merge the matches back to original dataframe

matches = df['description'].str.extractall('(\d{2}/\d{2})').unstack()
matches.columns = ['match1', 'match2']
final = df.merge(matches, left_index=True, right_index=True, how='left')

Output

                                          description          label match1 match2
0    description: kartu debit 20/10 indomaretcipete r     minimarket  20/10    NaN
1                      description: tarikan atm 20/10  atm penarikan  20/10    NaN
2                              description: biaya adm   administrasi    NaN    NaN
3   description: trsf e-banking db 18/10 wsid:2388...       transfer  18/10    NaN
4   description: switching biaya txn di 008 komp c...          biaya    NaN    NaN
5   description: switching withdrawal di 008 komp ...      penarikan    NaN    NaN
6   description: trsf e-banking db tanggal :13/10 ...       personal  13/10  13/10
7   description: trsf e-banking db 1310/ftva/ws269...        fintech    NaN    NaN
8    description: kartu debit 09/10 starbuckspasaraya          other  09/10    NaN
9   description: byr via e-banking 13/09 wsid46841...          pulsa  13/09    NaN
10  description: switching db biaya txn ke 022 dan...  biaya fintech    NaN    NaN
11         description: kartu debit spbu totalterogon           fuel    NaN    NaN

answered Aug 7, 2019 at 9:53

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ebuzz168 Over a year ago

Thank you, it is awesome to split it to 2 columns, I learned alot!

Erfan Over a year ago

No worries, dont forgot to accept one of the answer @ebuzz168

Hryhorii Pavlenko · Accepted Answer · 2019-08-07 10:02:07Z

1

I used str.findall to have all possible matches in one column, joined by comma (by default it would be a list containing all matches).

df['date'] = df['description'].str.findall(r'(\d{2}/\d{2})').apply(', '.join)

# output 
df['date'].values

array(['20/10', '20/10', '', '18/10', '', '', '13/10, 13/10', '', '09/10',
       '13/09', '', ''], dtype=object)

Edit:

Use str.join, as @Erfan suggested:

df['date'] = df['description'].str.findall(r'(\d{2}/\d{2})').str.join(', ')

edited Aug 7, 2019 at 10:02

answered Aug 7, 2019 at 9:55

Hryhorii Pavlenko

3,9104 gold badges21 silver badges38 bronze badges

3 Comments

Erfan Over a year ago

Better to use str.join instead of apply: df['description'].str.findall('(\d{2}/\d{2})').str.join(', ')

Hryhorii Pavlenko Over a year ago

I don't think I've seen .str.join before. Thanks, @Erfan! Edited my answer

Hryhorii Pavlenko Over a year ago

@ebuzz168 glad I could help

Jaroslav Bezděk · Accepted Answer · 2019-08-07 10:13:26Z

1

I think you are almost there. Just delete this row: date = df.description that is unnecessary and use apply function to get the dates to data frame column. Your code can look like the following (considering df is your defined data frame):

# imports
import numpy as np
import re

# define function to be used in apply
def get_date(row):
    date = row['description']
    date_list = re.findall(r'\d{2}/\d{2}', date)
    if date_list:
        return date_list[0]
    return np.NaN

# make date column
df['date'] = df.apply(lambda row: get_date(row), axis=1)

edited Aug 7, 2019 at 10:13

answered Aug 7, 2019 at 9:54

Jaroslav Bezděk

7,7156 gold badges34 silver badges59 bronze badges

Collectives™ on Stack Overflow

Extracting date using regex on DataFrame?

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related