I'm new to Python and Pandas, and I've pulled in a database table that contains 15+ different datetime columns. My task is to sort these columns generally by earliest to latest value in the rows. However, the data is not clean; sometimes, where Column A's date would come before Column B's date in Row 0, A would come after B in Row 1.
I wrote a few functions (redacted here for simplicity) that compare two columns by calculating the percentage of times dates in A come before and after B, and then sorting the columns based on that percentage:
def get_percentage(df, df_subset):
return len(df_subset)/float(len(df))
def duration_report(df, earlier_column, later_column):
results = {}
td = df[later_column] - df[earlier_column]
results["Before"] = get_percentage(df, df.loc[td >= pd.Timedelta(0)])
results["After"] = get_percentage(df, df.loc[td <= pd.Timedelta(0)])
ind = "%s vs %s" % (earlier_column, later_column)
return pd.DataFrame(data=results, index=[ind])
def order_date_columns(df, col1, col2):
before = duration_report(df, col1, col2).Before.values[0]
after = duration_report(df, col1, col2).After.values[0]
if before >= after:
return [col1, col2]
else:
return [col2, col1]
My goal with the above code is to programmatically implement the following:
If Col A dates come before Col B dates 50+% of the time, Col A should come before Col B in the list of earliest to latest datetime columns.
The order_date_columns() function successfully sorts two columns into the correct order, but how do I apply this sorting to the 15+ columns at once? I've looked into df.apply(), lambda, and map(), but haven't been able to crack this problem.
Any help (with code clarity/efficiency, too) would be appreciated!