2

Using Python 3.9 with Pycharm on mac. I'm loading a .csv containing slices of months of dates/times and values. All data is in strings. Each slice is internally sorted, but the combined list of slices are not. Instead of "123 456 789" it is "321 654 876", as can be seen here:

                time,value
2019-12-11 10:00:00,156
2019-12-11 09:00:00,156
2020-02-07 20:00:00,149.5
2020-02-07 19:00:00,149.8

To remedy this I first convert time column into datetime using df['time'] = pd.to_datetime(df['time']). I then sort the values using df.sort_values(by="time", inplace=True, ascending=True). The resulting list appears correct (even if I graph it), but I tried to create an error finder that compares each value to the last and flags how many times it is out of order:

error1 = 0
error2 = 0
   for _ in range(len(df)):
        if _ > 0:
            if df['time'][_] > df['time'][_ - 1]:
                error1 += 1

df.sort_values(by="time", inplace=True, ascending=True)

for _ in range(len(df)):
    if _ > 0:
        if df['time'][_] > df['time'][_ - 1]:
            x = df['time'][_]
            y = df['time'][_] - df['time'][_ - 1]
            error2 += 1

print(error1 == error2)

Output: True

The post-sort loop should flag either 0% or 100% of values (depending on ascending/descending), but it still flags the same errors as the pre-sort loop. Oddly, the corresponding values on the post-sort list look like they sorted appropriately.

I tried:

  • I confirmed that I was not sorting strings and I was including "inplace=True", which are two common sort_value SO questions
  • I also tried assigning to a different variable: df_sorted = df.sort_values(by="time") However this gave the same result.
2
  • 1
    df.sort_values does not change the index. df['time'][_] are the same for both sorted and unsorted dataframe. You have better options: df['time'].diff().gt(0).sum(). Commented Nov 30, 2021 at 16:48
  • @QuangHoang The index seems to be the issue here! I thought that adding "inplace=True" would be sufficient. However, adding ignore_index=True and another line with df.reset_index(drop=True, inplace=True) had the desired effect! Note I meant error2 = 0, but you are also correct about the .diff() method; I will implement that. Commented Nov 30, 2021 at 17:27

1 Answer 1

1

@QuangHoang has the correct answer in the comments.

Expanded: Although df.sort_values(by="time", inplace=True, ascending=True) has the effect of re-sorting the values in the column specified, it does not change the index. To do this, I added ignore_index=True and another line df.reset_index(drop=True, inplace=True), making the full code:

df.sort_values(by="time", inplace=True, ascending=True, ignore_index=True)
df.reset_index(drop=True, inplace=True)

As a last check when saving the file, I added index=True which saves the index when exporting as CSV. That way you can directly troubleshoot what is happening to the values and indices with any manipulation.

df.to_csv(f'fname', index=True)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.