0

I have a large DataFrame object where missing values are pre-coded as 0.001. These missing values only occur at the beginning of the DataFrame. For example:

df = pd.DataFrame({'a':[0.001, 0.001, 0.001, 0.50, 0.10, 0.001, 0.75]})

The problem is.... sometimes there are actual 0.001 values not at the beginning of the DataFrame that I dont want to drop (like in the example above).

What I want is:

df = pd.DataFrame({'a' :[NaN, NaN, NaN, 0.50, 0.10, 0.001, 0.75]})

Put I can't figure out a simple way to only drop the 0.001 values at the beginning of the DataFrame, and ignore the others that occur later on.

The dataset I'm working with is massive, so I was hoping to avoide looping through each variable and each index (which is what I'm currently doing but takes a bit too long).

Any ideas?

1

1 Answer 1

3

Here's an approach:

df.mask(df[df!=0.001].ffill().isnull(), np.nan)
Out: 
       a
0    NaN
1    NaN
2    NaN
3  0.500
4  0.100
5  0.001
6  0.750

This first creates a boolean mask where the df does not equal 0.001. The cells that have 0.001 will be NaN in this selection. If you forward fill this Series/DataFrame, the first elements will not be filled. Then you can use this as a mask to the original DataFrame.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.