8

In my dataset, I have a date column in which the data is of mixed format.

  • Some rows only have YYYY (e.g. 2009)
  • Others have MMM-YY (e.g. Jan-08)
  • Yet others have a full date MM/DD/YYYY (e.g. 01/15/2006)

How might I remove the entries that do not follow the format MM/DD/YYYY? I'm not quite sure where to start here.

Below, I have provided the .head() of the data as a dict.

{'Collection Date': {0: '2001',
  1: '2002',
  2: '2006',
  3: '2/19/2006',
  4: '2/28/2006'},
 'Complete Genome': {0: 'No', 1: 'No', 2: 'No', 3: 'No', 4: 'No'},
 'Country': {0: 'Egypt', 1: 'Egypt', 2: 'Egypt', 3: 'Egypt', 4: 'Egypt'},
 'Flu Season': {0: '-N/A-', 1: '-N/A-', 2: '-N/A-', 3: '-N/A-', 4: '-N/A-'},
 'Host Species': {0: 'IRD:Human',
  1: 'IRD:Human',
  2: 'IRD:Bird/Avian',
  3: 'IRD:Chicken/Avian',
  4: 'IRD:Avian'},
 'Protein Name': {0: 'NA', 1: 'NA', 2: 'NA', 3: 'NA', 4: 'HA'},
 'Segment': {0: 6, 1: 6, 2: 6, 3: 6, 4: 4},
 'Segment Length': {0: 1428, 1: 1449, 2: 1441, 3: 1363, 4: 1707},
 'Sequence Accession': {0: 'AJ457944',
  1: 'AJ457943',
  2: 'GU050304',
  3: 'GQ184251',
  4: 'KF178948'},
 'State/Province': {0: '-N/A-',
  1: '-N/A-',
  2: '-N/A-',
  3: '-N/A-',
  4: '-N/A-'},
 'Strain Name': {0: '(A/Egypt/84/2001(H1N2))',
  1: '(A/Egypt/96/2002(H1N2))',
  2: 'A/avian/Egypt/920431/2006(H9N2)',
  3: 'A/chicken/Egypt/06207-NLQP/2006(H5N1)',
  4: 'A/chicken/Egypt/0626/2006'},
 'Subtype': {0: 'H1N2', 1: 'H1N2', 2: 'H9N2', 3: 'H5N1', 4: 'H5N1'}}
8
  • Probably better to split your data using regexp and then apply some date parsing using strptime IMO, I can't think of an easier method Commented Feb 4, 2014 at 15:41
  • @EdChum is right, this is very tricky to do this (and pandas for efficiency won't). Commented Feb 4, 2014 at 15:46
  • An alternate thing that I could try is if I were to simply remove data that does not fit the format MM/DD/YYYY. Would that be possible? I will also modify the question. Commented Feb 4, 2014 at 15:47
  • @ericmjl yes you could do that, again using regexp Commented Feb 4, 2014 at 15:49
  • 1
    @ericmjl I just read your new requirement and updated my answer to filter dates that are not MM/DD/YYYY, in your case using string length is easier and simpler than a messy regexp string Commented Feb 4, 2014 at 16:17

2 Answers 2

8

Rather than applying Regular expression to match just MM/DD/YYYY, if your dates are only ever YYYY or MMM/YYYY or MM/DD/YYYY then you can exploit the fact that MM/DD/YYYY is a string of length 10:

In [8]:

import pandas as pd

pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'date':['01/03/1987', '2003', 'Jan-08', '31/01/2010']})
df
Out[8]:
         date
0  01/03/1987
1        2003
2      Jan-08
3  31/01/2010

[4 rows x 1 columns]
In [9]:

df.ix[df.date.str.len() !=10]
Out[9]:
     date
1    2003
2  Jan-08

[2 rows x 1 columns]

You can then just use to_datetime:

In [16]:

df1 = df.ix[df.date.str.len() !=10]
df1
Out[16]:
     date
1    2003
2  Jan-08

[2 rows x 1 columns]
In [17]:

df1.date = pd.to_datetime(df1.date)
df1
Out[17]:
                 date
1 2003-01-01 00:00:00
2 2014-01-08 00:00:00

[2 rows x 1 columns]

Just for completeness if you wanted to filter using regexp:

df.ix[~df.date.str.contains('(\d{2})[/](\d{2})[/](\d{4})')]

would work, note the ~ which is a negation

Sign up to request clarification or add additional context in comments.

Comments

8

You can use pd.to_datetime with the option errors='coerce' to convert invalid dates to NaT and then filter out NaTs with dropna()

Example:

>>> df = pd.DataFrame({'date':['01/03/1987', '2003', 'Jan-08', '31/01/2010', '2/13/2016'],'value':range(5)})
>>> df
         date  value
0  01/03/1987      0
1        2003      1
2      Jan-08      2
3  31/01/2010      3
4   2/13/2016      4

Format: DD/MM/YYYY

>>> pd.to_datetime(df['date'], format='%d/%m/%Y', errors='coerce')
0   1987-03-01
1          NaT
2          NaT
3   2010-01-31
4          NaT
>>> df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y', errors='coerce')
>>> df.dropna()
        date  value
0 1987-03-01      0
3 2010-01-31      3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.