Extracting multiple dates from text in python

Question

I have a DataFrame that looks like this

Unit ID   Shipping to:
90        With x
91        With y
92        With z
116       Shipped to x 01/04/16. / Shipped to y - 09/08/18.
233       Shipped to z 03/01/17
265       Shipped to x 03/01/17 returned shipped to x 02/05/17
280       Shipped to x 06/01/17  Shipped to y 03/05/17 Shipped to z 12/12/17

I would like to be able to extract all occurrences of x,y or z and the date that follows it if there is one. I can't confirm how many occurrences of z,y or z there will be but I would like an end result that looks something like this:

 Unit ID  Occurrence 1  Occurrence 2  Occurrence 3 Shipping to:
    90    x                                        With x
    91    y                                        With y
    92    z                                        With z
    116   x 01/04/16    y 09/08/18                 Shipped to x 01/04/16. / Shipped to y - 09/08/18.
    233   z 03/01/17                               Shipped to z 03/01/17
    265   x 03/01/17                               Shipped to x 03/01/17 returned shipped to x 02/05/17
    280   x 06/01/17    y 03/05/17    z 12/12/17   Shipped to x 06/01/17  Shipped to y 03/05/17 Shipped to z 12/12/17

so far I've only managed to extract the first date that appears in every column using this

date_col = []
for row in df['Shipping to:']:
    match = re.search('\d{2}/\d{2}/\d{2}',str(row),re.IGNORECASE)
    date_col.append(match)
df['dates'] = date_col

Looks like, i. e. giving some examples is far below a hard-wired specification. What variants of format are allowed, how many "occurrences" at maximum, what is the date order (m/d/y), etc. So most of the work is not Python-related but required to refine the specification. — guidot
– guidot, Commented Oct 22, 2018 at 12:57
@guidot it seems quite clear that the regex to match his dates is \d{2}/\d{2}/\d{2}, I would say that is quite specific. — rje
– rje, Commented Oct 22, 2018 at 14:18
@rje: and which of the d{2} do you guess to associate with the month? Wouldn't it be nice to reject or not recognize illegal days and months? — guidot
– guidot, Commented Oct 22, 2018 at 19:13
@guidot I would say that is not a task for regex. Use a regex to find candidates, then a parser to filter out illegal days and months. — rje
– rje, Commented Oct 22, 2018 at 19:16

rje · Accepted Answer · 2018-10-22 14:51:03Z

The dataframe itself has a very nice function for this:

df['Shipping to:'].str.extractall(r'(\d{1,2}/\d{1,2}/\d{2})').unstack()

Note that I changed your regex to include a group (with ()) and that I'm matching single digits as well for the month and day.

Testing the following DataFrame (I know it's nonsense but its just a test):

df = pd.DataFrame([['1/22/33'], ['2/33/44  aaa 22/112/3 gook'], ['22/4/55'], [''], [None], ['aaa 22/5/66 aa 11/22/33']], columns=['Shipping to:'])

I get this output:

match   0   1
0   1/22/33     NaN
1   2/33/44     NaN
2   22/4/55     NaN
5   22/5/66     11/22/33

To include the x/y/z at the start, change the regex to r'([xyz] \d{1,2}/\d{1,2}/\d{2})'. Finally, if you want to add these matches as new columns to your original dataframe, you can use join. The code then becomes:

df.join(df['Shipping to:'].str.extractall(r'([xyz] \d{1,2}/\d{1,2}/\d{2})')\
    .unstack()[0])

Note that I get column 0 after calling unstack - this effectively removes 1 level of the multi-index and prevents join from complaining. Now just because I was happily playing around with this, I added some code to fix the column names so they match your example:

df.join(df['Shipping to:'].str.extractall(r'([xyz] \d{1,2}/\d{1,2}/\d{2})')\
    .unstack()[0]\
    .rename(columns=lambda x: "Occurence " + str(x)))

Collectives™ on Stack Overflow

Extracting multiple dates from text in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related