0

I have a column of about 100 values; mixed integers and decimals (eg 27, 27.2, 28) but they're stored as datatype string (eg '27', '27.2', '28'). The data was compiled from multiple sources and some of those compiling the data did not have the precision necessary for the decimal values and so entered data with '>' or "<' characters. So add a '>27' to the example column:

col_1
27
>27
27.2
28

The values are out of sort and I would like to sort them from lowest to highest and convert them back to datatype string. My solution is to convert everything to a float, sort on the numerical values and then convert everything back but the values without precision are getting in the way. My thinking was to add characters to the end of those values, say '.001', and removing the ">" and "<" characters before converting, sorting, and then converting everything back. So, when doing the operation to add '0.001' to the string value I do this:

df['col_1'].loc[df['col_1].str.contains('>')] = df['col_1'].loc[df['col_1].str.contains('>')] + '.001'

Is there a better, more acceptable, or maybe more efficient way to do this?

4 Answers 4

2
diff = df['col_1'].str.contains(r'<').map({True:-0.001})
diff.update(df['col_1'].str.contains(r'>').map({True:0.001}))
diff = diff.fillna(0)

df['col_1'] = df['col_1'].str.replace(r'[^0-9.eE-]', '', regex=True).astype(float) + diff
df = df.sort_values('col_1', ascending=False)

Output:

>>> df
  col_1
0   <26
1    27
2   >27
3  27.2
4    28

>>> diff = df['col_1'].str.contains(r'<').map({True:-0.001})
>>> diff.update(df['col_1'].str.contains(r'>').map({True:0.001}))
>>> diff = diff.fillna(0)
>>> diff
0   -0.001
1    0.000
2    0.001
3    0.000
4    0.000
Name: col_1, dtype: float64

>>> df['col_1'] = df['col_1'].str.replace(r'[^0-9.eE-]', '', regex=True).astype(float) + diff
>>> df = df.sort_values('col_1', ascending=False)
>>> df
    col_1
0  25.999
1  27.000
2  27.001
3  27.200
4  28.000
Sign up to request clarification or add additional context in comments.

3 Comments

Oh man, these two answers are great. I don't know which one to check mark. Thanks!
You should check this one :) It's vectorized, meaning it will be faster. :)
You're working hard for that check. I might just have to give it to you. lol
1

Here is one way to approach the problem

s = df['col_1'].str.extract(r'(<|>)?(.*)')
s[0], s[1] = s[0].fillna('='), s[1].astype(float)

df_new = df.loc[s.sort_values([1, 0]).index]

Step by step details

  1. Extract symbols and numbers
>>> s

     0     1
0  NaN    27
1    >    27
2  NaN  27.2
3  NaN    28
4    <    27
  1. Fill the NaN values in column 0 with =, since < < = < >. Then change the dtype of column 1 to float
>>> s

   0     1
0  =  27.0
1  >  27.0
2  =  27.2
3  =  28.0
4  <  27.0
  1. Sort the above dataframe and get the sorted index, then use the sorted index to sort the given dataframe
>>> df_new

  col_1
4   <27
0    27
1   >27
2  27.2
3    28

5 Comments

You might want s[0].map({'>': 0.001, '<': -0.001}).fillna(0)
Thanks @user17242583 map would also work, but in general a single call to fillna would be slightly more efficient than the map + fillna ;-)
But the OP wants to add 0.001 for >, right?
Not sure about that, but I guess the OP is only interested in sorting the dataframe.
Yeah. The value doesn’t matter in as much as the only possible decimals in the data have one digit. I’m only adding (or subtracting) .001 to place the “<“ or “>” just above or below the numeric value in the list. I could use .01, .001, or even .000001. It wouldn’t matter. I’m only adding it to sort and then removing it and putting the “<“ or “>” back.
1

use an apply and a regular expression to transform the string data into a float for sorting.

col_1=['27','>27','27.2','28']
df=pd.DataFrame({'col_1':col_1})

df['sort']=df['col_1'].apply(lambda x: float(re.sub('>','',x))+0.001 if '>' in x else float(x))

print(df)

output

  col_1    sort
0    27  27.000
1   >27  27.001
2  27.2  27.200
3    28  28.000

2 Comments

Thank you. This isn't a consideration with my frame because at most it will have 100 to 200 rows and 20-50 columns. I do have other frames that I will be adding this operation to that are much larger though so thank you for that insight. I appreciate it.
readibility is more important than speed for small datasets
1

Use pd.eval. Replace '>' and '<' by '0.001+' and '-0.001+' then evaluate the expression to convert as float numbers. Finally, use sort_values and reindex to sort your original dataframe.

Setup an example:

df = pd.DataFrame({'col_1': ['>27', '28', '27', '<27', '27.2'], 
                   'col_2': range(11, 16)})
print(df)

# Output:
  col_1  col_2
0   >27     11
1    28     12
2    27     13
3   <27     14
4  27.2     15
out = df.reindex(df['col_1'].replace({'>': '0.001+', '<': '-0.001+'}, regex=True) \
                            .apply(pd.eval).sort_values().index)
print(out)

# Output:
  col_1  col_2
3   <27     14
2    27     13
0   >27     11
4  27.2     15
1    28     12

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.