How can I sort datetime columns by row value in a Pandas dataframe?

Question

I'm new to Python and Pandas, and I've pulled in a database table that contains 15+ different datetime columns. My task is to sort these columns generally by earliest to latest value in the rows. However, the data is not clean; sometimes, where Column A's date would come before Column B's date in Row 0, A would come after B in Row 1.

I wrote a few functions (redacted here for simplicity) that compare two columns by calculating the percentage of times dates in A come before and after B, and then sorting the columns based on that percentage:

def get_percentage(df, df_subset):
    return len(df_subset)/float(len(df))    

def duration_report(df, earlier_column, later_column):   
    results = {}
    td = df[later_column] - df[earlier_column]
    results["Before"] = get_percentage(df, df.loc[td >= pd.Timedelta(0)])
    results["After"] = get_percentage(df, df.loc[td <= pd.Timedelta(0)])
    ind = "%s vs %s" % (earlier_column, later_column)
    return pd.DataFrame(data=results, index=[ind])

def order_date_columns(df, col1, col2):
    before = duration_report(df, col1, col2).Before.values[0]
    after = duration_report(df, col1, col2).After.values[0]
    if before >= after:
        return [col1, col2]
    else:
        return [col2, col1]

My goal with the above code is to programmatically implement the following:

If Col A dates come before Col B dates 50+% of the time, Col A should come before Col B in the list of earliest to latest datetime columns.

The order_date_columns() function successfully sorts two columns into the correct order, but how do I apply this sorting to the 15+ columns at once? I've looked into df.apply(), lambda, and map(), but haven't been able to crack this problem.

Any help (with code clarity/efficiency, too) would be appreciated!

For the record, I'm using Anaconda's Python 2.7.12.

alemosie
– alemosie

2016-09-27 14:39:06 +00:00
Commented Sep 27, 2016 at 14:39 — alemosie
– alemosie, Commented Sep 27, 2016 at 14:39

Sohier Dane · Accepted Answer · 2016-09-26 22:39:10Z

2

If you don't mind taking a bit of a shortcut and using the median of each date column, this should work:

def order_date_columns(df, date_columns_to_sort):
    x = [(col, df[col].astype(np.int64).median()) for col in date_columns_to_sort]
    return [x[0] for x in sorted(x, key=lambda x: x[1])]

answered Sep 26, 2016 at 22:39

Sohier Dane

1421 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

alemosie Over a year ago

Thanks for this! I applied it to the data and the output looks mostly correct, but tests a few assumptions of mine; I'll have to investigate further. The dates in this data are pretty funky, hence why I was more comfortable with a step-by-step, percentage-based approach. I'd still love to know how to do this the "long" way -- more as a practice to apply to similar problems in the future!

mfsiega · Accepted Answer · 2016-09-27 16:29:05Z

Since you're using Python 2.7, you can use the cmp keyword argument to sorted. To get the column names in the order that you're looking for, I would do something like:

# Returns -1 if first_column[i] > second_column[i] more often.
# Returns 1 if vice versa.
# Returns 0 if equal.
# Assumes df[first_column] and df[second_column] are the same length.
def compare_two(first_column, second_column):
    c1_greater_count = 0
    c2_greater_count = 0
    # Iterate over the two columns in the dataframe. df must be in accessible scope.
    for i in range(len(df[first_column])):
        if df[first_column].iloc(i) > df[second_column].iloc[i]:
            c1_greater_count += 1
        elif df[second_column].iloc[i] > df[first_column].iloc[i]:
            c2_greater_count += 1

    if c1_greater_count > c2_greater_count:
        return -1
    if c2_greater_count > c1_greater_count:
        return 1
    return 0

df = get_dataframe_from_somewhere()
relevant_column_names = get_relevant_column_names(df) # e.g., get all the dates.
sorted_column_names = sorted(relevant_column_names, cmp=compare_two)

# sorted_column_names holds the names of the relevant columns,
# sorted according to the given ordering.

I'm sure there's a more Pythonic way to do it, but this should work. Note that for Python 3, you can use the cmp_to_key utility.

Collectives™ on Stack Overflow

How can I sort datetime columns by row value in a Pandas dataframe?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related