2

I have a very large dataframe.
I wanna create a new column 'result' based on other columns 'userid' and 'date'.
The userid have two or more records.

import pandas as pd
import numpy as np

userid = ['1','1','22','48','48','48','393','393','555','555'] 
date = ['11/01/2016','11/02/2016','11/05/2016','11/08/2016','12/02/2016','02/12/2017','02/22/2017','02/28/2017','12/15/2016','02/28/2017'] 
df1 = pd.DataFrame({"userid": userid, "date": date})

userid  date
  1   11/01/2016
  1   11/02/2016
 22   11/05/2016
 48   11/08/2016
 48   12/02/2016
 48   02/12/2017
393   02/22/2017
393   02/28/2017
555   12/15/2016
555   02/28/2017

There are two types of values in this new column 'result'.
'1': If the userid appears before 02/01/2017, and on or after 02/01/2017 (both conditions should be satisfied), the value return is '1'.
'0': If the above conditions aren't met, this row should be assigned to '0'.

Example 1: userid 48 appears twice before 02/01/2017 and appears once after 02/01/2017. Hence, the value in result column of userid 48 should be '1' because both conditions are satisfied.
Example 2: userid 393 appears twice in our data but its date is after 02/01/2017 in both records. Hence, the value in result column of userid 393 should be '0'.

In this case, my output data frame will be:

userid     date   result
  1    11/01/2016   0
  1    11/02/2016   0
 22    11/05/2016   0
 48    11/08/2016   1
 48    12/02/2016   1
 48    02/12/2017   1
393    02/22/2017   0
393    02/28/2017   0
555    12/15/2016   1
555    02/28/2017   1

I haven't got any idea the best way to achieve this.
Can anyone help? Thanks in advance!

1 Answer 1

5

This should do the trick

import pandas as pd
import numpy as np
import datetime

userid = ['1','1','22','48','48','48','393','393','555','555'] 
date = ['11/01/2016','11/02/2016','11/05/2016','11/08/2016','12/02/2016','02/12/2017','02/22/2017','02/28/2017','12/15/2016','02/28/2017'] 
df1 = pd.DataFrame({"userid": userid, "date": date})

# convert date type to datetime
df1['date'] = pd.to_datetime(df1['date'])

# define threshold date
dt = datetime.datetime(2017, 2, 1)

# logic
fn = lambda _: 1 if _.min()<dt and _.max()>=dt else 0
res = df1.groupby('userid')['date'].agg(fn).reset_index()
res.rename({'date':'result'}, axis=1, inplace=True)
df1.merge(res)

Output

userid     date   result
  1    11/01/2016   0
  1    11/02/2016   0
 22    11/05/2016   0
 48    11/08/2016   1
 48    12/02/2016   1
 48    02/12/2017   1
393    02/22/2017   0
393    02/28/2017   0
555    12/15/2016   1
555    02/28/2017   1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.