Using Python 3.9 with Pycharm on mac. I'm loading a .csv containing slices of months of dates/times and values. All data is in strings. Each slice is internally sorted, but the combined list of slices are not. Instead of "123 456 789" it is "321 654 876", as can be seen here:
time,value
2019-12-11 10:00:00,156
2019-12-11 09:00:00,156
2020-02-07 20:00:00,149.5
2020-02-07 19:00:00,149.8
To remedy this I first convert time column into datetime using df['time'] = pd.to_datetime(df['time']). I then sort the values using df.sort_values(by="time", inplace=True, ascending=True). The resulting list appears correct (even if I graph it), but I tried to create an error finder that compares each value to the last and flags how many times it is out of order:
error1 = 0
error2 = 0
for _ in range(len(df)):
if _ > 0:
if df['time'][_] > df['time'][_ - 1]:
error1 += 1
df.sort_values(by="time", inplace=True, ascending=True)
for _ in range(len(df)):
if _ > 0:
if df['time'][_] > df['time'][_ - 1]:
x = df['time'][_]
y = df['time'][_] - df['time'][_ - 1]
error2 += 1
print(error1 == error2)
Output: True
The post-sort loop should flag either 0% or 100% of values (depending on ascending/descending), but it still flags the same errors as the pre-sort loop. Oddly, the corresponding values on the post-sort list look like they sorted appropriately.
I tried:
- I confirmed that I was not sorting strings and I was including "inplace=True", which are two common sort_value SO questions
- I also tried assigning to a different variable:
df_sorted = df.sort_values(by="time")However this gave the same result.
df.sort_valuesdoes not change the index.df['time'][_]are the same for both sorted and unsorted dataframe. You have better options:df['time'].diff().gt(0).sum()."inplace=True"would be sufficient. However, addingignore_index=Trueand another line withdf.reset_index(drop=True, inplace=True)had the desired effect! Note I meant error2 = 0, but you are also correct about the .diff() method; I will implement that.