Extract date from string in a pandas dataframe column

Question

I am trying to extract date from a DF column containing strings and store in another column.

from dateutil.parser import parse
 
extract = parse("January 24, 1976", fuzzy_with_tokens=True)
print(str(extract[0]))

The above code extracts: 1976-01-24 00:00:00

I would like this to be done to all strings in a column in a DF.

The below is what I am trying but is not working:

df['Dates'] = df.apply(lambda x: parse(x['Column to extract'], fuzzy_with_tokens=True), axis=1)

Things to note:

If there are multiple dates, need to join them with some delimiter
There can be strings without date. In that case parser returns an error "ParserError: String does not contain a date". This needs to be handled.

(1) can you provide some example data? Not sure I understand what you mean by "multiple dates... join with some delimiter". (2) how would you want to handle strings that aren't dates? convert to NaT? — Ian Thompson
– Ian Thompson, Commented Nov 17, 2022 at 16:46

Ian Thompson · Accepted Answer · 2022-11-17 17:43:13Z

1

See pd.to_datetime

It operates in a vectorized manner so can convert all dates quickly.

df["Dates"] = pd.to_datetime(df["Dates"])

If there are strings that won't convert to a datetime and you want them nullified, you can use errors="coerce"

df["Dates"] = pd.to_datetime(df["Dates"], errors="coerce")

NER with `spacy`

import spacy  # 3.4.2
from spacy import displacy


nlp = spacy.load("en_core_web_sm")

eg_txt = "today is january 26, 2016. Tomorrow is january 27, 2016"

doc = nlp(eg_txt)

displacy.render(doc, style="ent")

We can apply the spacy logic to a dataframe

import pandas as pd  # 1.5.1


# some fake data
df = pd.DataFrame({
    "text": ["today is january 26, 2016. Tomorrow is january 27, 2016",
             "today is january 26, 2016.",
              "Tomorrow is january 27, 2016"]
})

# convert text to spacy docs
docs = nlp.pipe(df.text.to_numpy())

# unpack the generator into a series
doc_series = pd.Series(docs, index=df.index, name="docs")

df = df.join(doc_series)

# extract entities
df["entities"] = df.docs.apply(lambda x: x.ents)

# explode to one entity per row
df = df.explode(column="entities")

# build dictionary of ent type and ent text
df["entities"] = df.entities.apply(lambda ent: {ent.label_: ent.text})

# join back with df
df = df.join(df["entities"].apply(pd.Series))

# convert all DATE entities to datetime
df["dates"] = pd.to_datetime(df.DATE, errors="coerce")

# back to one row per original text and a container of datetimes
df = df.groupby("text").dates.unique().to_frame().reset_index()

print(df)

                                                text                                              dates
0                       Tomorrow is january 27, 2016               [NaT, 2016-01-27T00:00:00.000000000]
1                         today is january 26, 2016.  [2022-11-17T11:42:49.607705000, 2016-01-26T00:...
2  today is january 26, 2016. Tomorrow is january...  [2022-11-17T11:42:49.605705000, 2016-01-26T00:...

edited Nov 17, 2022 at 17:43

answered Nov 17, 2022 at 16:42

Ian Thompson

3,3252 gold badges22 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Krishna Over a year ago

I am trying to extract date present in any string (for ex: today is january 26, 2016. Tomorrow is january 27, 2016 should return 2016-01-26 00:00:00 | 2016-01-27 00:00:00). Here | is the delimiter

Ian Thompson Over a year ago

That sounds a bit like Named Entity Recognition (NER). You may want to look into spacy or nltk

Nuri Taş · Accepted Answer · 2022-11-17 16:56:03Z

0

If you want to use parse, you may need a customized function to handle exceptions:

def parse_date(row):
    try:
        date = parse(row, fuzzy_with_tokens=True)
        return date[0]
    except:
        return np.nan


df['dates'] = df['Column to extract'].apply(lambda x: parse_date(x))

answered Nov 17, 2022 at 16:56

Nuri Taş

3,8552 gold badges8 silver badges22 bronze badges

2 Comments

Krishna Over a year ago

Was hoping this would work. But, I am only getting 'NaN'

Nuri Taş Over a year ago

You're expected to share a reproducible sample of your dataframe at this point. It works fine for the following dataframe: df = pd.DataFrame({'Column to extract':['no date', "January 24, 1976", "January 25, 1976, "]})

Collectives™ on Stack Overflow

Extract date from string in a pandas dataframe column

2 Answers 2

NER with `spacy`

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

NER with spacy

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related

NER with `spacy`