Sorting column of numerical string data in pandas

Question

I have a column of about 100 values; mixed integers and decimals (eg 27, 27.2, 28) but they're stored as datatype string (eg '27', '27.2', '28'). The data was compiled from multiple sources and some of those compiling the data did not have the precision necessary for the decimal values and so entered data with '>' or "<' characters. So add a '>27' to the example column:

col_1
27
>27
27.2
28

The values are out of sort and I would like to sort them from lowest to highest and convert them back to datatype string. My solution is to convert everything to a float, sort on the numerical values and then convert everything back but the values without precision are getting in the way. My thinking was to add characters to the end of those values, say '.001', and removing the ">" and "<" characters before converting, sorting, and then converting everything back. So, when doing the operation to add '0.001' to the string value I do this:

df['col_1'].loc[df['col_1].str.contains('>')] = df['col_1'].loc[df['col_1].str.contains('>')] + '.001'

Is there a better, more acceptable, or maybe more efficient way to do this?

user17242583 · Accepted Answer · 2021-11-23 17:43:29Z

2

diff = df['col_1'].str.contains(r'<').map({True:-0.001})
diff.update(df['col_1'].str.contains(r'>').map({True:0.001}))
diff = diff.fillna(0)

df['col_1'] = df['col_1'].str.replace(r'[^0-9.eE-]', '', regex=True).astype(float) + diff
df = df.sort_values('col_1', ascending=False)

Output:

>>> df
  col_1
0   <26
1    27
2   >27
3  27.2
4    28

>>> diff = df['col_1'].str.contains(r'<').map({True:-0.001})
>>> diff.update(df['col_1'].str.contains(r'>').map({True:0.001}))
>>> diff = diff.fillna(0)
>>> diff
0   -0.001
1    0.000
2    0.001
3    0.000
4    0.000
Name: col_1, dtype: float64

>>> df['col_1'] = df['col_1'].str.replace(r'[^0-9.eE-]', '', regex=True).astype(float) + diff
>>> df = df.sort_values('col_1', ascending=False)
>>> df
    col_1
0  25.999
1  27.000
2  27.001
3  27.200
4  28.000

answered Nov 23, 2021 at 17:43

user17242583

Sign up to request clarification or add additional context in comments.

3 Comments

Dan Over a year ago

Oh man, these two answers are great. I don't know which one to check mark. Thanks!

user17242583 Over a year ago

You should check this one :) It's vectorized, meaning it will be faster. :)

Dan Over a year ago

You're working hard for that check. I might just have to give it to you. lol

Shubham Sharma · Accepted Answer · 2021-11-23 18:50:18Z

1

Here is one way to approach the problem

s = df['col_1'].str.extract(r'(<|>)?(.*)')
s[0], s[1] = s[0].fillna('='), s[1].astype(float)

df_new = df.loc[s.sort_values([1, 0]).index]

Step by step details

Extract symbols and numbers

>>> s

     0     1
0  NaN    27
1    >    27
2  NaN  27.2
3  NaN    28
4    <    27

Fill the NaN values in column 0 with =, since < < = < >. Then change the dtype of column 1 to float

Sort the above dataframe and get the sorted index, then use the sorted index to sort the given dataframe

>>> df_new

  col_1
4   <27
0    27
1   >27
2  27.2
3    28

answered Nov 23, 2021 at 18:50

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

5 Comments

user17242583 Over a year ago

You might want s[0].map({'>': 0.001, '<': -0.001}).fillna(0)

Shubham Sharma Over a year ago

Thanks @user17242583 map would also work, but in general a single call to fillna would be slightly more efficient than the map + fillna ;-)

user17242583 Over a year ago

But the OP wants to add 0.001 for >, right?

Shubham Sharma Over a year ago

Not sure about that, but I guess the OP is only interested in sorting the dataframe.

Dan Over a year ago

Yeah. The value doesn’t matter in as much as the only possible decimals in the data have one digit. I’m only adding (or subtracting) .001 to place the “<“ or “>” just above or below the numeric value in the list. I could use .01, .001, or even .000001. It wouldn’t matter. I’m only adding it to sort and then removing it and putting the “<“ or “>” back.

ListenSoftware Louise Ai Agent · Accepted Answer · 2021-11-23 17:37:48Z

1

use an apply and a regular expression to transform the string data into a float for sorting.

col_1=['27','>27','27.2','28']
df=pd.DataFrame({'col_1':col_1})

df['sort']=df['col_1'].apply(lambda x: float(re.sub('>','',x))+0.001 if '>' in x else float(x))

print(df)

output

  col_1    sort
0    27  27.000
1   >27  27.001
2  27.2  27.200
3    28  28.000

answered Nov 23, 2021 at 17:37

ListenSoftware Louise Ai Agent

4,3432 gold badges31 silver badges39 bronze badges

2 Comments

Dan Over a year ago

Thank you. This isn't a consideration with my frame because at most it will have 100 to 200 rows and 20-50 columns. I do have other frames that I will be adding this operation to that are much larger though so thank you for that insight. I appreciate it.

ListenSoftware Louise Ai Agent Over a year ago

readibility is more important than speed for small datasets

Corralien · Accepted Answer · 2021-11-23 20:16:19Z

1

Use pd.eval. Replace '>' and '<' by '0.001+' and '-0.001+' then evaluate the expression to convert as float numbers. Finally, use sort_values and reindex to sort your original dataframe.

Setup an example:

df = pd.DataFrame({'col_1': ['>27', '28', '27', '<27', '27.2'], 
                   'col_2': range(11, 16)})
print(df)

# Output:
  col_1  col_2
0   >27     11
1    28     12
2    27     13
3   <27     14
4  27.2     15

out = df.reindex(df['col_1'].replace({'>': '0.001+', '<': '-0.001+'}, regex=True) \
                            .apply(pd.eval).sort_values().index)
print(out)

# Output:
  col_1  col_2
3   <27     14
2    27     13
0   >27     11
4  27.2     15
1    28     12

answered Nov 23, 2021 at 20:16

Corralien

121k8 gold badges43 silver badges68 bronze badges

Collectives™ on Stack Overflow

Sorting column of numerical string data in pandas

4 Answers 4

3 Comments

5 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

5 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related