1

I have a DataFrame that looks like this

Unit ID   Shipping to:
90        With x
91        With y
92        With z
116       Shipped to x 01/04/16. / Shipped to y - 09/08/18.
233       Shipped to z 03/01/17
265       Shipped to x 03/01/17 returned shipped to x 02/05/17
280       Shipped to x 06/01/17  Shipped to y 03/05/17 Shipped to z 12/12/17

I would like to be able to extract all occurrences of x,y or z and the date that follows it if there is one. I can't confirm how many occurrences of z,y or z there will be but I would like an end result that looks something like this:

 Unit ID  Occurrence 1  Occurrence 2  Occurrence 3 Shipping to:
    90    x                                        With x
    91    y                                        With y
    92    z                                        With z
    116   x 01/04/16    y 09/08/18                 Shipped to x 01/04/16. / Shipped to y - 09/08/18.
    233   z 03/01/17                               Shipped to z 03/01/17
    265   x 03/01/17                               Shipped to x 03/01/17 returned shipped to x 02/05/17
    280   x 06/01/17    y 03/05/17    z 12/12/17   Shipped to x 06/01/17  Shipped to y 03/05/17 Shipped to z 12/12/17

so far I've only managed to extract the first date that appears in every column using this

date_col = []
for row in df['Shipping to:']:
    match = re.search('\d{2}/\d{2}/\d{2}',str(row),re.IGNORECASE)
    date_col.append(match)
df['dates'] = date_col
4
  • 2
    Looks like, i. e. giving some examples is far below a hard-wired specification. What variants of format are allowed, how many "occurrences" at maximum, what is the date order (m/d/y), etc. So most of the work is not Python-related but required to refine the specification. Commented Oct 22, 2018 at 12:57
  • @guidot it seems quite clear that the regex to match his dates is \d{2}/\d{2}/\d{2}, I would say that is quite specific. Commented Oct 22, 2018 at 14:18
  • @rje: and which of the d{2} do you guess to associate with the month? Wouldn't it be nice to reject or not recognize illegal days and months? Commented Oct 22, 2018 at 19:13
  • @guidot I would say that is not a task for regex. Use a regex to find candidates, then a parser to filter out illegal days and months. Commented Oct 22, 2018 at 19:16

1 Answer 1

1

The dataframe itself has a very nice function for this:

df['Shipping to:'].str.extractall(r'(\d{1,2}/\d{1,2}/\d{2})').unstack()

Note that I changed your regex to include a group (with ()) and that I'm matching single digits as well for the month and day.

Testing the following DataFrame (I know it's nonsense but its just a test):

df = pd.DataFrame([['1/22/33'], ['2/33/44  aaa 22/112/3 gook'], ['22/4/55'], [''], [None], ['aaa 22/5/66 aa 11/22/33']], columns=['Shipping to:'])

I get this output:

match   0   1
0   1/22/33     NaN
1   2/33/44     NaN
2   22/4/55     NaN
5   22/5/66     11/22/33

To include the x/y/z at the start, change the regex to r'([xyz] \d{1,2}/\d{1,2}/\d{2})'. Finally, if you want to add these matches as new columns to your original dataframe, you can use join. The code then becomes:

df.join(df['Shipping to:'].str.extractall(r'([xyz] \d{1,2}/\d{1,2}/\d{2})')\
    .unstack()[0])

Note that I get column 0 after calling unstack - this effectively removes 1 level of the multi-index and prevents join from complaining. Now just because I was happily playing around with this, I added some code to fix the column names so they match your example:

df.join(df['Shipping to:'].str.extractall(r'([xyz] \d{1,2}/\d{1,2}/\d{2})')\
    .unstack()[0]\
    .rename(columns=lambda x: "Occurence " + str(x)))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.