I have a very large dataframe.
I wanna create a new column 'result' based on other columns 'userid' and 'date'.
The userid have two or more records.
import pandas as pd
import numpy as np
userid = ['1','1','22','48','48','48','393','393','555','555']
date = ['11/01/2016','11/02/2016','11/05/2016','11/08/2016','12/02/2016','02/12/2017','02/22/2017','02/28/2017','12/15/2016','02/28/2017']
df1 = pd.DataFrame({"userid": userid, "date": date})
userid date
1 11/01/2016
1 11/02/2016
22 11/05/2016
48 11/08/2016
48 12/02/2016
48 02/12/2017
393 02/22/2017
393 02/28/2017
555 12/15/2016
555 02/28/2017
There are two types of values in this new column 'result'.
'1': If the userid appears before 02/01/2017, and on or after 02/01/2017 (both conditions should be satisfied), the value return is '1'.
'0': If the above conditions aren't met, this row should be assigned to '0'.
Example 1: userid 48 appears twice before 02/01/2017 and appears once after 02/01/2017. Hence, the value in result column of userid 48 should be '1' because both conditions are satisfied.
Example 2: userid 393 appears twice in our data but its date is after 02/01/2017 in both records. Hence, the value in result column of userid 393 should be '0'.
In this case, my output data frame will be:
userid date result
1 11/01/2016 0
1 11/02/2016 0
22 11/05/2016 0
48 11/08/2016 1
48 12/02/2016 1
48 02/12/2017 1
393 02/22/2017 0
393 02/28/2017 0
555 12/15/2016 1
555 02/28/2017 1
I haven't got any idea the best way to achieve this.
Can anyone help? Thanks in advance!