0

Simple Question Here:

b = 8143.1795845088482
d = 14723.523658084257

My Df called final:

Words       score
This      90374.98788
is        80559.4495
a         43269.67002
sample    34535.01172
output    Very Low

I want to replace all the scores with either 'very low', 'low', 'medium', or 'high' based on whether they fall between quartile ranges.

something like this works:

final['score'][final['score'] <= b] = 'Very Low' #This is shown in the example above 

but when I try to play this immediately after it doesn't work:

final['score'][final['score'] >= b] and final['score'][final['score'] <= d] = 'Low'

This gives me the error: cannot assign operator. Anyone know what I am missing?

2 Answers 2

2

Firstly you must use the bitwise (e.g. &, | instead of and , or) operators as you are comparing arrays and therefore all the values and not a single value (it becomes ambiguoous to compare arrays like this plus you cannot override the global and operator to behave like you want), secondly you must use parentheses around multiple conditions due to operator precendence.

Finally you are performing chain indexing which may or may not work and will raise a warning, to set your column value use loc like this:

In [4]:

b = 25 
d = 50
final.loc[(final['score'] >= b) & (final['score'] <= d), 'score'] = 'Low'
final
Out[4]:
  Words score
0  This    10
1    is   Low
2   for   Low
3   You   704
Sign up to request clarification or add additional context in comments.

5 Comments

Hi Ed, this throws the following error: ValueError: Arrays were different lengths: 58 vs 1
You'll have to edit valid input data and your code for me to reproduce your error. On your data you posted my code works fine as you can see.
updated, I haven't the slightest why this error is being thrown -- OP updated
it's being caused by the first replace line and the inclusion of the values 'Very Low' in the final df...just not sure how to get around this
@user3682157 your problem here is that once you overwrite the value for the first quartile you effectively change the dtype to be a mixture of strings and ints/floats. The comparison now no longer works, it would be better to assign the string representations to a new column like unutbu suggests
1

If your DataFrame's scores were all floats,

In [234]: df
Out[234]: 
    Words        score
0    This  90374.98788
1      is  80559.44950
2       a  43269.67002
3  sample  34535.01172

then you could use pd.qcut to categorize each value by its quartile:

In [236]: df['quartile'] = pd.qcut(df['score'], q=4, labels=['very low', 'low', 'medium', 'high'])

In [237]: df
Out[237]: 
    Words        score  quartile
0    This  90374.98788      high
1      is  80559.44950    medium
2       a  43269.67002       low
3  sample  34535.01172  very low

DataFrame columns have a dtype. When the values are all floats, then it has a float dtype, which can be very fast for numerical calculations. When the values are a mixture of floats and strings then the dtype is object, which mean each value is a Python object. While this gives the values a lot of flexibility, it is also very slow since every operation ultimately resorts back to calling a Python function instead of a NumPy/Panda C/Fortran/Cython function. Thus you should try to avoid mixing floats and strings in a single column.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.