2

I am new to text mining and I need to extract the dates from a *.txt file and sort them. The dates are in between the sentences ( each line) and their format can potentially be as follows:

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010

If the day is missing consider the 1st and if the month is missing consider January.

My idea is to extract all dates and convert that into mm/dd/yyyy format. However I am a bit doubtful on how to find and replace paterns. This is what i have done :

import pandas as pd

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)

df2 = pd.DataFrame(df,columns=['text'])

def myfunc(x):
    if len(x)==4:
        x = '01/01/'+x
    else:
        if not re.search('/',x):
            example = re.sub('[-]','/',x)
            terms = re.split('/',x)
            if (len(terms)==2):
                if len(terms[-1])==2:
                    x = '01/'+terms[0]+'/19'+terms[-1]
                else:
                    x = '01/'+terms[0]+'/'+terms[-1] 
            elif len(terms[-1])==2:
                x = terms[0].zfill(2)+'/'+terms[1].zfill(2)+'/19'+terms[-1]
    return x

df2['text'] = df2.text.str.replace(r'(((?:\d+[/-])?\d+[/-]\d+)|\d{4})', lambda x: myfunc(x.groups('Date')[0]))

I have done it only for the numerical dates format. But I am a bit confused how to do it with the alfanumerical dates.

I know is a rough code but this is just what I got.

0

1 Answer 1

18

I think this is one of the coursera text mining assignment. Well you can use regex and extract to get the solution. dates.txt i.e

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)

def date_sorter():
    # Get the dates in the form of words
    one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
    # Get the dates in the form of numbers
    two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
    # Get the dates where there is no days i.e only month and year  
    three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
    #Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.
    dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())

date_sorter()

Output:

9     1971-04-10
84    1971-05-18
2     1971-07-08
53    1971-07-11
28    1971-09-12
474   1972-01-01
153   1972-01-13
13    1972-01-26
129   1972-05-06
98    1972-05-13
111   1972-06-10
225   1972-06-15
31    1972-07-20
171   1972-10-04
191   1972-11-30
486   1973-01-01
335   1973-02-01
415   1973-02-01
36    1973-02-14
405   1973-03-01
323   1973-03-01
422   1973-04-01
375   1973-06-01
380   1973-07-01
345   1973-10-01
57    1973-12-01
481   1974-01-01
436   1974-02-01
104   1974-02-24
299   1974-03-01

If you want to return only the index then return pd.Series(dates.sort_values().index)

Parsing of first regex

 #?: Non-capturing group 

((?:\d{,2}\s)? # The two digits group. `?` refers to preceding token or group. Here the digits of 2 or 1 and space occurring once or less.  

 (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* # The words in group ending with any letters `[]` occuring any number of times (`*`). 

 (?:-|\.|\s|,) # Pattern matching -,.,space 

 \s? #(`?` here it implies only to space i.e the preceding token)

 \d{,2}[a-z]* # less than or equal to two digits having any number of letters at the end (`*`). (Eg: may be 1st, 13th , 22nd , Jan , December etc ) . 

 (?:-|,|\s)?# The characters -/,/space may occur once and may not occur because of `?` at the end

 \s? # space may occur or may not occur at all (maximum is 1) (`?` here it refers only to space)

 \d{2,4}) # Match digit which is 2 or 4   

Hope it helps.

Sign up to request clarification or add additional context in comments.

2 Comments

@bharath shetty I ahve a quesiton regarding the ?. To how many previous elements is affecting. For isntance the last '?' int he code below (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s? is refering only to [a-z]*(?:-|\.|\s|,)\s?
I'm unable to understand the question properly. You mean ending ? in (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:‌​-|\.|\s|,)\s? is refering to which group???

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.