Pandas replace values in dataframe timeseries

Question

I have a pandas dataframe df with pandas.tseries.index.DatetimeIndex as index.

The data is like this:

Time                 Open  High Low   Close Volume
2007-04-01 21:02:00 1.968 2.389 1.968 2.389 18.300000
2007-04-01 21:03:00 157.140 157.140 157.140 157.140 2.400000

....

I want to replace one datapoint, lets day 2.389 in column Close with NaN:

In: df["Close"].replace(2.389, np.nan)
Out: 2007-04-01 21:02:00      2.389
     2007-04-01 21:03:00    157.140

Replace did not change 2.389 to NaN. Whats wrong?

unutbu · Accepted Answer · 2015-01-16 20:03:37Z

6

replace might not work with floats because the floating point representation you see in the repr of the DataFrame might not be the same as the underlying float. For example, the actual Close value might be:

In [141]: df = pd.DataFrame({'Close': [2.389000000001]})

yet the repr of df looks like:

In [142]: df
Out[142]: 
   Close
0  2.389

So instead of checking for float equality, it is usually better to check for closeness:

In [150]: import numpy as np
In [151]: mask = np.isclose(df['Close'], 2.389)

In [152]: mask
Out[152]: array([ True], dtype=bool)

You can then use the boolean mask to select and change the desired values:

In [145]: df.loc[mask, 'Close'] = np.nan

In [146]: df
Out[146]: 
   Close
0    NaN

answered Jan 16, 2015 at 20:03

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

harbun Over a year ago

It worked! I am a bit confused though, in the source csv it's exactly 2.389 - why did the value change after loading it into the dataframe? All I did was import like this: df = read_csv("data.csv", parse_dates=[0], infer_datetime_format=True, index_col=0)

unutbu Over a year ago

The floating point representation of decimals is not exact. So when the CSV string "2.389" is parsed into a float, the floating is not exactly 2.389; it is instead the number closest to 2.389 which is representable as a float. To see the exact value stored in a Python float use Decimal. For example, import decimal, decimal.Decimal(2.389) yields Decimal('2.388999999999999790389892950770445168018341064453125')

unutbu Over a year ago

Also note that the DataFrame is storing the float in a NumPy array of dtype float32 or float64 -- a 32-bit or 64-bit float. On my (64-bit) machine, pd.DataFrame([[2.389]]).iloc[0,0] returns 2.3889999999999998.

EdChum · Accepted Answer · 2015-01-16 19:56:12Z

3

You need to assign the result to df['Close'] or pass param inplace=True : df['Close'].replace(2.389, np.NaN, inplace=True)

e.g.:

In [5]:

df['Close'] = df['Close'].replace(2.389, np.NaN)
df['Close']
Out[5]:
0      2.389
1    157.140
Name: Close, dtype: float64

Most pandas operations return a copy and some accept the param inplace.

Check the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.replace.html#pandas.Series.replace

answered Jan 16, 2015 at 19:56

EdChum

397k204 gold badges836 silver badges583 bronze badges

2 Comments

harbun Over a year ago

Unfortunately in this case it did not work, but if replace would actually find the value, this would be the way to go.

EdChum Over a year ago

I think unutbu's answer is the obvious correct answer, strangely for me it worked fine without having to do anything special

Collectives™ on Stack Overflow

Pandas replace values in dataframe timeseries

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related