1

my df looks like this:

sprint   sprint_created
------   -----------
S100     2020-01-01    
S101     2020-01-10
NULL     2020-01-20
NULL     2020-01-31
S101     2020-01-10
...

in the above df, you can see that some of the sprint values are NULL

I have another df2 that has sprint date ranges:

sprint   sprint_start   sprint_end
------   -----------    ----------
S100     2020-01-01     2020-01-09    
S101     2020-01-10     2020-01-19  
S102     2020-01-20     2020-01-29  
S103     2020-01-30     2020-02-09  
S104     2020-02-10     2020-02-19  
...

How can I map these data and fill in the NULL values in the df by comparing the data in the df2?

Please note that the shape of df and df2 are different.

6
  • Are both DataFrames guaranteed to use the same row based index? i.e. row 5 in df always corresponds to row 5 in df2. Or do you need to match them based on the sprint_created and sprint_started columns? (which look the same here but may not be). Pandas has good documentation on concatenation and merging: pandas.pydata.org/pandas-docs/stable/user_guide/merging.html Commented Jan 7, 2021 at 23:55
  • As I can see, the Sprint column has multiple repeated values in first table. What is the key that you would use to identify and join a row if your index values dont match between the 2 tables? Commented Jan 7, 2021 at 23:59
  • in df2 yes, since it is providing the value of the sprint date ranges. But df could be random Commented Jan 8, 2021 at 0:04
  • df will have multiple same sprintid and the key in df would be another column which is project_id Commented Jan 8, 2021 at 0:05
  • 1
    thanks for pointing that out, I updated the question Commented Jan 8, 2021 at 0:07

2 Answers 2

1

One way would be to melt and resample your df2 and create a dictionary to map back to df1:

#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])

#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
          .groupby('sprint', group_keys=False).resample('D').ffill().reset_index())

#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]: 
  sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10
Sign up to request clarification or add additional context in comments.

Comments

1

I assummed duplicated sprint in df(first dataframe can be dropped). Please advice otherwise if not so. I use merge asof with one day tolerance based on my comparison of the two dfs you provided. Advice otherwise if so

df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y']).dropna()

  sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31

If your frame has legit multiple sprints as explained above in comments. Please try;

g=df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y'])
g.loc[g.sprint.isna(), 'sprint']=g.groupby('sprint_created').sprint.ffill()
print(g)



sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.