2

I want to extracted date from description column to another column. But, I have countered some issues.

This is my DataFrame code:

df = pd.DataFrame({'description':['description: kartu debit 20/10 indomaretcipete r', 'description: tarikan atm 20/10', 
                                 'description: biaya adm', 'description: trsf e-banking db 18/10 wsid:23881 riri indah lestari', 
                                 'description: switching biaya txn di 008 komp clandak armori', 'description: switching withdrawal di 008 komp clandak imori', 
                                 'description: trsf e-banking db tanggal :13/10 13/10 wsid:269b1 dwi ayu mustika', 
                                 'description: trsf e-banking db 1310/ftva/ws269b100240/home credit - - 3800372540', 
                                 'description: kartu debit 09/10 starbuckspasaraya', 'description: byr via e-banking 13/09 wsid46841381200 telkomsel 081293112183 tezar alamsyah', 
                                 'description: switching db biaya txn ke 022 danabijak tezar albank centra', 'description: kartu debit spbu totalterogon'], 
                   'label': ['minimarket', 'atm penarikan', 'administrasi', 'transfer', 'biaya', 'penarikan', 'personal', 
                             'fintech', 'other', 'pulsa', 'biaya fintech', 'fuel']})

and this is the what I have been tried:

for date in df.description:
    date = df.description
    date = re.findall(r'\d{2}/\d{2}', date)

    print(date)

But the output is TypeError: expected string or bytes-like object

6
  • 1
    Try df['description'].str.extractall(r'(\d{2}/\d{2})') ..? Commented Aug 7, 2019 at 9:41
  • In your for loop, the first line for each iteration is assigning the variable date to a pandas the Series 'description'. Don't do this - remove that line. Also I'd suggest giving your iterating variable a different name, instead of using date for example just use for description in df.description: ... then date = re.findall(r'\d{2}/\d{2}', description) Commented Aug 7, 2019 at 9:50
  • @ChrisA This method is nice, thanks for the information. Commented Aug 7, 2019 at 21:25
  • All the answer below is correct. I love how @political scientist gave me a simple one line code yet it's answer my question, Commented Aug 8, 2019 at 2:23
  • how @Jaroslav Bezdek gave me an awesome function, Commented Aug 8, 2019 at 2:23

3 Answers 3

1

To completely answer your question:

  1. Use str.extractall
  2. Unstack rows to columns
  3. Merge the matches back to original dataframe
matches = df['description'].str.extractall('(\d{2}/\d{2})').unstack()
matches.columns = ['match1', 'match2']
final = df.merge(matches, left_index=True, right_index=True, how='left')

Output

                                          description          label match1 match2
0    description: kartu debit 20/10 indomaretcipete r     minimarket  20/10    NaN
1                      description: tarikan atm 20/10  atm penarikan  20/10    NaN
2                              description: biaya adm   administrasi    NaN    NaN
3   description: trsf e-banking db 18/10 wsid:2388...       transfer  18/10    NaN
4   description: switching biaya txn di 008 komp c...          biaya    NaN    NaN
5   description: switching withdrawal di 008 komp ...      penarikan    NaN    NaN
6   description: trsf e-banking db tanggal :13/10 ...       personal  13/10  13/10
7   description: trsf e-banking db 1310/ftva/ws269...        fintech    NaN    NaN
8    description: kartu debit 09/10 starbuckspasaraya          other  09/10    NaN
9   description: byr via e-banking 13/09 wsid46841...          pulsa  13/09    NaN
10  description: switching db biaya txn ke 022 dan...  biaya fintech    NaN    NaN
11         description: kartu debit spbu totalterogon           fuel    NaN    NaN
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, it is awesome to split it to 2 columns, I learned alot!
No worries, dont forgot to accept one of the answer @ebuzz168
1

I used str.findall to have all possible matches in one column, joined by comma (by default it would be a list containing all matches).

df['date'] = df['description'].str.findall(r'(\d{2}/\d{2})').apply(', '.join)
# output 
df['date'].values

array(['20/10', '20/10', '', '18/10', '', '', '13/10, 13/10', '', '09/10',
       '13/09', '', ''], dtype=object)

Edit:

Use str.join, as @Erfan suggested:

df['date'] = df['description'].str.findall(r'(\d{2}/\d{2})').str.join(', ')

3 Comments

Better to use str.join instead of apply: df['description'].str.findall('(\d{2}/\d{2})').str.join(', ')
I don't think I've seen .str.join before. Thanks, @Erfan! Edited my answer
@ebuzz168 glad I could help
1

I think you are almost there. Just delete this row: date = df.description that is unnecessary and use apply function to get the dates to data frame column. Your code can look like the following (considering df is your defined data frame):

# imports
import numpy as np
import re

# define function to be used in apply
def get_date(row):
    date = row['description']
    date_list = re.findall(r'\d{2}/\d{2}', date)
    if date_list:
        return date_list[0]
    return np.NaN

# make date column
df['date'] = df.apply(lambda row: get_date(row), axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.