0

I am trying to extract date from a DF column containing strings and store in another column.

from dateutil.parser import parse
 
extract = parse("January 24, 1976", fuzzy_with_tokens=True)
print(str(extract[0]))

The above code extracts: 1976-01-24 00:00:00

I would like this to be done to all strings in a column in a DF.

The below is what I am trying but is not working:

df['Dates'] = df.apply(lambda x: parse(x['Column to extract'], fuzzy_with_tokens=True), axis=1)

Things to note:

  1. If there are multiple dates, need to join them with some delimiter
  2. There can be strings without date. In that case parser returns an error "ParserError: String does not contain a date". This needs to be handled.
1
  • (1) can you provide some example data? Not sure I understand what you mean by "multiple dates... join with some delimiter". (2) how would you want to handle strings that aren't dates? convert to NaT? Commented Nov 17, 2022 at 16:46

2 Answers 2

1

See pd.to_datetime

It operates in a vectorized manner so can convert all dates quickly.

df["Dates"] = pd.to_datetime(df["Dates"])

If there are strings that won't convert to a datetime and you want them nullified, you can use errors="coerce"

df["Dates"] = pd.to_datetime(df["Dates"], errors="coerce")

NER with spacy

import spacy  # 3.4.2
from spacy import displacy


nlp = spacy.load("en_core_web_sm")

eg_txt = "today is january 26, 2016. Tomorrow is january 27, 2016"

doc = nlp(eg_txt)

displacy.render(doc, style="ent")

ents

We can apply the spacy logic to a dataframe

import pandas as pd  # 1.5.1


# some fake data
df = pd.DataFrame({
    "text": ["today is january 26, 2016. Tomorrow is january 27, 2016",
             "today is january 26, 2016.",
              "Tomorrow is january 27, 2016"]
})

# convert text to spacy docs
docs = nlp.pipe(df.text.to_numpy())

# unpack the generator into a series
doc_series = pd.Series(docs, index=df.index, name="docs")

df = df.join(doc_series)

# extract entities
df["entities"] = df.docs.apply(lambda x: x.ents)

# explode to one entity per row
df = df.explode(column="entities")

# build dictionary of ent type and ent text
df["entities"] = df.entities.apply(lambda ent: {ent.label_: ent.text})

# join back with df
df = df.join(df["entities"].apply(pd.Series))

# convert all DATE entities to datetime
df["dates"] = pd.to_datetime(df.DATE, errors="coerce")

# back to one row per original text and a container of datetimes
df = df.groupby("text").dates.unique().to_frame().reset_index()

print(df)
                                                text                                              dates
0                       Tomorrow is january 27, 2016               [NaT, 2016-01-27T00:00:00.000000000]
1                         today is january 26, 2016.  [2022-11-17T11:42:49.607705000, 2016-01-26T00:...
2  today is january 26, 2016. Tomorrow is january...  [2022-11-17T11:42:49.605705000, 2016-01-26T00:...
Sign up to request clarification or add additional context in comments.

2 Comments

I am trying to extract date present in any string (for ex: today is january 26, 2016. Tomorrow is january 27, 2016 should return 2016-01-26 00:00:00 | 2016-01-27 00:00:00). Here | is the delimiter
That sounds a bit like Named Entity Recognition (NER). You may want to look into spacy or nltk
0

If you want to use parse, you may need a customized function to handle exceptions:

def parse_date(row):
    try:
        date = parse(row, fuzzy_with_tokens=True)
        return date[0]
    except:
        return np.nan


df['dates'] = df['Column to extract'].apply(lambda x: parse_date(x))

2 Comments

Was hoping this would work. But, I am only getting 'NaN'
You're expected to share a reproducible sample of your dataframe at this point. It works fine for the following dataframe: df = pd.DataFrame({'Column to extract':['no date', "January 24, 1976", "January 25, 1976, "]})

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.