5

I have a pandas dataframe df with pandas.tseries.index.DatetimeIndex as index.

The data is like this:

Time                 Open  High Low   Close Volume
2007-04-01 21:02:00 1.968 2.389 1.968 2.389 18.300000
2007-04-01 21:03:00 157.140 157.140 157.140 157.140 2.400000

....

I want to replace one datapoint, lets day 2.389 in column Close with NaN:

In: df["Close"].replace(2.389, np.nan)
Out: 2007-04-01 21:02:00      2.389
     2007-04-01 21:03:00    157.140

Replace did not change 2.389 to NaN. Whats wrong?

2 Answers 2

6

replace might not work with floats because the floating point representation you see in the repr of the DataFrame might not be the same as the underlying float. For example, the actual Close value might be:

In [141]: df = pd.DataFrame({'Close': [2.389000000001]})

yet the repr of df looks like:

In [142]: df
Out[142]: 
   Close
0  2.389

So instead of checking for float equality, it is usually better to check for closeness:

In [150]: import numpy as np
In [151]: mask = np.isclose(df['Close'], 2.389)

In [152]: mask
Out[152]: array([ True], dtype=bool)

You can then use the boolean mask to select and change the desired values:

In [145]: df.loc[mask, 'Close'] = np.nan

In [146]: df
Out[146]: 
   Close
0    NaN
Sign up to request clarification or add additional context in comments.

3 Comments

It worked! I am a bit confused though, in the source csv it's exactly 2.389 - why did the value change after loading it into the dataframe? All I did was import like this: df = read_csv("data.csv", parse_dates=[0], infer_datetime_format=True, index_col=0)
The floating point representation of decimals is not exact. So when the CSV string "2.389" is parsed into a float, the floating is not exactly 2.389; it is instead the number closest to 2.389 which is representable as a float. To see the exact value stored in a Python float use Decimal. For example, import decimal, decimal.Decimal(2.389) yields Decimal('2.388999999999999790389892950770445168018341064453125')
Also note that the DataFrame is storing the float in a NumPy array of dtype float32 or float64 -- a 32-bit or 64-bit float. On my (64-bit) machine, pd.DataFrame([[2.389]]).iloc[0,0] returns 2.3889999999999998.
3

You need to assign the result to df['Close'] or pass param inplace=True : df['Close'].replace(2.389, np.NaN, inplace=True)

e.g.:

In [5]:

df['Close'] = df['Close'].replace(2.389, np.NaN)
df['Close']
Out[5]:
0      2.389
1    157.140
Name: Close, dtype: float64

Most pandas operations return a copy and some accept the param inplace.

Check the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.replace.html#pandas.Series.replace

2 Comments

Unfortunately in this case it did not work, but if replace would actually find the value, this would be the way to go.
I think unutbu's answer is the obvious correct answer, strangely for me it worked fine without having to do anything special

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.