2

Using:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

a = pd.read_csv('file.csv', na_values=['-9999.0'], decimal=',')
a.index = pd.to_datetime(a[['Year', 'Month', 'Day', 'Hour', 'Minute']])
pd.options.mode.chained_assignment = None

The dataframe is something like:

Index               A    B       C      D
2016-07-20 18:00:00 9   4.0     NaN    2
2016-07-20 19:00:00 9   2.64    0.0    3
2016-07-20 20:00:00 12  2.59    0.0    1
2016-07-20 21:00:00 9   4.0     NaN    2

The main objective is to set np.nan to the entire row if the value on A column is 9 and on D column is 2 at the same time, for exemple:

Output expectation

Index               A    B       C      D
2016-07-20 18:00:00 NaN NaN     NaN    NaN
2016-07-20 19:00:00 9   2.64    0.0     3
2016-07-20 20:00:00 12  2.59    0.0     2
2016-07-20 21:00:00 NaN NaN     NaN    NaN

Would be thankful if someone could help.

0

4 Answers 4

4

Option 1
This is the opposite of @Jezrael's mask solution.

a.where(a.A.ne(9) | a.D.ne(2))

                        A     B    C    D
Index                                    
2016-07-20 18:00:00   NaN   NaN  NaN  NaN
2016-07-20 19:00:00   9.0  2.64  0.0  3.0
2016-07-20 20:00:00  12.0  2.59  0.0  1.0
2016-07-20 21:00:00   NaN   NaN  NaN  NaN

Option 2
pd.DataFrame.reindex

a[a.A.ne(9) | a.D.ne(2)].reindex(a.index)

                        A     B    C    D
Index                                    
2016-07-20 18:00:00   NaN   NaN  NaN  NaN
2016-07-20 19:00:00   9.0  2.64  0.0  3.0
2016-07-20 20:00:00  12.0  2.59  0.0  1.0
2016-07-20 21:00:00   NaN   NaN  NaN  NaN
Sign up to request clarification or add additional context in comments.

Comments

4

Try this:

df.loc[df.A.eq(9) & df.D.eq(2)] = [np.nan] * len(df.columns)

Demo:

In [158]: df
Out[158]:
                      A     B    C  D
Index
2016-07-20 18:00:00   9  4.00  NaN  2
2016-07-20 19:00:00   9  2.64  0.0  3
2016-07-20 20:00:00  12  2.59  0.0  1
2016-07-20 21:00:00   9  4.00  NaN  2

In [159]: df.loc[df.A.eq(9) & df.D.eq(2)] = [np.nan] * len(df.columns)

In [160]: df
Out[160]:
                        A     B    C    D
Index
2016-07-20 18:00:00   NaN   NaN  NaN  NaN
2016-07-20 19:00:00   9.0  2.64  0.0  3.0
2016-07-20 20:00:00  12.0  2.59  0.0  1.0
2016-07-20 21:00:00   NaN   NaN  NaN  NaN

alternatively we can use DataFrame.where() method:

In [174]: df = df.where(~(df.A.eq(9) & df.D.eq(2)))

In [175]: df
Out[175]:
                        A     B    C    D
Index
2016-07-20 18:00:00   NaN   NaN  NaN  NaN
2016-07-20 19:00:00   9.0  2.64  0.0  3.0
2016-07-20 20:00:00  12.0  2.59  0.0  1.0
2016-07-20 21:00:00   NaN   NaN  NaN  NaN

9 Comments

I get ValueError: cannot set using a list-like indexer with a different length than the value for first solution :(
@jezrael, can you provide a sample data set to reproduce this error?
@jezrael, i can't reproduce it
@jezrael, pandas: 0.20.1
Hmmm, ok. After change answer I can add your solution to timings. thanks.
|
4

Use mask, which create NaNs by default:

df = a.mask((a['A'] == 9) & (a['D'] == 2))
print (df)
                        A     B    C    D
Index                                    
2016-07-20 18:00:00   NaN   NaN  NaN  NaN
2016-07-20 19:00:00   9.0  2.64  0.0  3.0
2016-07-20 20:00:00  12.0  2.59  0.0  1.0
2016-07-20 21:00:00   NaN   NaN  NaN  NaN

Or boolean indexing with assign NaN:

a[(a['A'] == 9) & (a['D'] == 2)] = np.nan
print (a)
                        A     B    C    D
Index                                    
2016-07-20 18:00:00   NaN   NaN  NaN  NaN
2016-07-20 19:00:00   9.0  2.64  0.0  3.0
2016-07-20 20:00:00  12.0  2.59  0.0  1.0
2016-07-20 21:00:00   NaN   NaN  NaN  NaN

Timings:

np.random.seed(123)
N = 1000000
L = list('abcdefghijklmnopqrst'.upper())

a = pd.DataFrame(np.random.choice([np.nan,2,9], size=(N,20)), columns=L) 

#jez2
In [256]: %timeit a[(a['A'] == 9) & (a['D'] == 2)] = np.nan
10 loops, best of 3: 25.8 ms per loop

#jez2upr
In [257]: %timeit a.loc[(a['A'] == 9) & (a['D'] == 2)] = np.nan
10 loops, best of 3: 27.6 ms per loop

#Wen
In [258]: %timeit a.mul(np.where((a.A==9)&(a.D==2),np.nan,1),0)
10 loops, best of 3: 90.5 ms per loop

#jez1
In [259]: %timeit a.mask((a['A'] == 9) & (a['D'] == 2))
1 loop, best of 3: 316 ms per loop

#maxu2
In [260]: %timeit a.where(~(a.A.eq(9) & a.D.eq(2)))
1 loop, best of 3: 318 ms per loop

#pir1
In [261]: %timeit a.where(a.A.ne(9) | a.D.ne(2))
1 loop, best of 3: 316 ms per loop

#pir2
In [263]: %timeit a[a.A.ne(9) | a.D.ne(2)].reindex(a.index)
1 loop, best of 3: 355 ms per loop

Comments

2

Or you can try using.mul after np.where

a=np.where((df2.A==9)&(df2.D==2),np.nan,1)
df2.mul(a,0)
#one line df.mul(np.where((df.A==9)&(df.D==2),np.nan,1))

                        A     B    C    D
Index                                    
2016-07-20 18:00:00   NaN   NaN  NaN  NaN
2016-07-20 19:00:00   9.0  2.64  0.0  3.0
2016-07-20 20:00:00  12.0  2.59  0.0  1.0
2016-07-20 21:00:00   NaN   NaN  NaN  NaN

3 Comments

This is clever (-:
yes, indeed, it's a smart option! We can make a one-liner out of it: df.mul(np.where((df.A==9)&(df.D==2),np.nan,1))
@MaxU thank you , Yes, you are right online more neat ~

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.