2

I'm new to Python and Pandas, and I've pulled in a database table that contains 15+ different datetime columns. My task is to sort these columns generally by earliest to latest value in the rows. However, the data is not clean; sometimes, where Column A's date would come before Column B's date in Row 0, A would come after B in Row 1.

I wrote a few functions (redacted here for simplicity) that compare two columns by calculating the percentage of times dates in A come before and after B, and then sorting the columns based on that percentage:

def get_percentage(df, df_subset):
    return len(df_subset)/float(len(df))    

def duration_report(df, earlier_column, later_column):   
    results = {}
    td = df[later_column] - df[earlier_column]
    results["Before"] = get_percentage(df, df.loc[td >= pd.Timedelta(0)])
    results["After"] = get_percentage(df, df.loc[td <= pd.Timedelta(0)])
    ind = "%s vs %s" % (earlier_column, later_column)
    return pd.DataFrame(data=results, index=[ind])

def order_date_columns(df, col1, col2):
    before = duration_report(df, col1, col2).Before.values[0]
    after = duration_report(df, col1, col2).After.values[0]
    if before >= after:
        return [col1, col2]
    else:
        return [col2, col1]

My goal with the above code is to programmatically implement the following:

If Col A dates come before Col B dates 50+% of the time, Col A should come before Col B in the list of earliest to latest datetime columns.

The order_date_columns() function successfully sorts two columns into the correct order, but how do I apply this sorting to the 15+ columns at once? I've looked into df.apply(), lambda, and map(), but haven't been able to crack this problem.

Any help (with code clarity/efficiency, too) would be appreciated!

1
  • For the record, I'm using Anaconda's Python 2.7.12. Commented Sep 27, 2016 at 14:39

2 Answers 2

2

If you don't mind taking a bit of a shortcut and using the median of each date column, this should work:

def order_date_columns(df, date_columns_to_sort):
    x = [(col, df[col].astype(np.int64).median()) for col in date_columns_to_sort]
    return [x[0] for x in sorted(x, key=lambda x: x[1])]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this! I applied it to the data and the output looks mostly correct, but tests a few assumptions of mine; I'll have to investigate further. The dates in this data are pretty funky, hence why I was more comfortable with a step-by-step, percentage-based approach. I'd still love to know how to do this the "long" way -- more as a practice to apply to similar problems in the future!
1

Since you're using Python 2.7, you can use the cmp keyword argument to sorted. To get the column names in the order that you're looking for, I would do something like:

# Returns -1 if first_column[i] > second_column[i] more often.
# Returns 1 if vice versa.
# Returns 0 if equal.
# Assumes df[first_column] and df[second_column] are the same length.
def compare_two(first_column, second_column):
    c1_greater_count = 0
    c2_greater_count = 0
    # Iterate over the two columns in the dataframe. df must be in accessible scope.
    for i in range(len(df[first_column])):
        if df[first_column].iloc(i) > df[second_column].iloc[i]:
            c1_greater_count += 1
        elif df[second_column].iloc[i] > df[first_column].iloc[i]:
            c2_greater_count += 1

    if c1_greater_count > c2_greater_count:
        return -1
    if c2_greater_count > c1_greater_count:
        return 1
    return 0

df = get_dataframe_from_somewhere()
relevant_column_names = get_relevant_column_names(df) # e.g., get all the dates.
sorted_column_names = sorted(relevant_column_names, cmp=compare_two)

# sorted_column_names holds the names of the relevant columns,
# sorted according to the given ordering.

I'm sure there's a more Pythonic way to do it, but this should work. Note that for Python 3, you can use the cmp_to_key utility.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.