Pandas sort_values seems to sort list but getting similar errors?

Question

Using Python 3.9 with Pycharm on mac. I'm loading a .csv containing slices of months of dates/times and values. All data is in strings. Each slice is internally sorted, but the combined list of slices are not. Instead of "123 456 789" it is "321 654 876", as can be seen here:

                time,value
2019-12-11 10:00:00,156
2019-12-11 09:00:00,156
2020-02-07 20:00:00,149.5
2020-02-07 19:00:00,149.8

To remedy this I first convert time column into datetime using df['time'] = pd.to_datetime(df['time']). I then sort the values using df.sort_values(by="time", inplace=True, ascending=True). The resulting list appears correct (even if I graph it), but I tried to create an error finder that compares each value to the last and flags how many times it is out of order:

error1 = 0
error2 = 0
   for _ in range(len(df)):
        if _ > 0:
            if df['time'][_] > df['time'][_ - 1]:
                error1 += 1

df.sort_values(by="time", inplace=True, ascending=True)

for _ in range(len(df)):
    if _ > 0:
        if df['time'][_] > df['time'][_ - 1]:
            x = df['time'][_]
            y = df['time'][_] - df['time'][_ - 1]
            error2 += 1

print(error1 == error2)

Output: True

The post-sort loop should flag either 0% or 100% of values (depending on ascending/descending), but it still flags the same errors as the pre-sort loop. Oddly, the corresponding values on the post-sort list look like they sorted appropriately.

I tried:

I confirmed that I was not sorting strings and I was including "inplace=True", which are two common sort_value SO questions
I also tried assigning to a different variable: df_sorted = df.sort_values(by="time") However this gave the same result.

df.sort_values does not change the index. df['time'][_] are the same for both sorted and unsorted dataframe. You have better options: df['time'].diff().gt(0).sum(). — Quang Hoang
– Quang Hoang, Commented Nov 30, 2021 at 16:48
@QuangHoang The index seems to be the issue here! I thought that adding "inplace=True" would be sufficient. However, adding ignore_index=True and another line with df.reset_index(drop=True, inplace=True) had the desired effect! Note I meant error2 = 0, but you are also correct about the .diff() method; I will implement that. — Andrew M
– Andrew M, Commented Nov 30, 2021 at 17:27

Andrew M · Accepted Answer · 2021-12-01 15:23:21Z

1

@QuangHoang has the correct answer in the comments.

Expanded: Although df.sort_values(by="time", inplace=True, ascending=True) has the effect of re-sorting the values in the column specified, it does not change the index. To do this, I added ignore_index=True and another line df.reset_index(drop=True, inplace=True), making the full code:

df.sort_values(by="time", inplace=True, ascending=True, ignore_index=True)
df.reset_index(drop=True, inplace=True)

As a last check when saving the file, I added index=True which saves the index when exporting as CSV. That way you can directly troubleshoot what is happening to the values and indices with any manipulation.

df.to_csv(f'fname', index=True)

answered Dec 1, 2021 at 15:23

Andrew M

1198 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas sort_values seems to sort list but getting similar errors?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related