1

I have a small dataframe containing 320k rows and 450 of columns. There are some of lists with column numbers:

list1 = [1,3,5,...]
list2 = [4,9,...]
...

My goal is to replace certain values in each column from current list and then to save it:

df[df[list1] > 7] = np.nan
df[df[list2] >90] = np.nan
...

The size of dataframe made me to do it by chunks:

for chunk in pd.read_csv(filePrev,chunksize=10000,header=None):
>>>  chunk[chunk[list1] >= 7] = np.nan
>>>  chunk[chunk[list2] >= 90] = np.nan
...
>>>  chunk.to_csv(newFile,mode='a',header=False,index=False)

But there was an incorrect work: I've run this code many times, and it mostly couldn't finish working (Memory Errors in IPython or application just closed by Windows), with any chunksize value. But when it finished, it replaced all the values to NaN in most of the strings, and there was some of the strings where all was replaced correctly.


I've tried the same logic on the little piece of the same dataset and it works correctly!

In [11]: df = pd.read_csv(filePrev,nrows=5,usecols=[1,2,3,4,5,6,7],header=None)

In [12]: df
Out[12]:
   1  2  3  4  5  6  7
0  1  1  1  1  1  1  1
1  3  1  1  1  2  1  1
2  3  1  1  1  1  1  1
3  3  1  1  1  2  1  2
4  3  1  1  1  1  1  1

In [13]: list = [1,7]

In [14]: df[df[list] > 1] = np.nan

In [15]: df
Out[15]:
    1  2  3  4  5  6   7
0   1  1  1  1  1  1   1
1 NaN  1  1  1  2  1   1
2 NaN  1  1  1  1  1   1
3 NaN  1  1  1  2  1 NaN
4 NaN  1  1  1  1  1   1

So, any ideas about it? Can we implement it in "chunking" mode, or there is another way (then I need an example)? I just want to replace certain values to NaN... :)

1

2 Answers 2

1

It's possible this could be improved by keeping the file open, rather than opening the file each time in append mode:

with open(newFile, 'a') as f:
    for chunk in pd.read_csv(filePrev,chunksize=10000,header=None):
        chunk[chunk[list1] >= 7] = np.nan
        chunk[chunk[list2] >= 90] = np.nan
        chunk.to_csv(f, header=False, index=False)

Someone reported this behaviour here recently, and this change gave them a 98.3% performance gain on Windows (I only saw around 25% on osx).


If you run your python code with Profile or (ipython's) %prun, you can see what's being called for the most time and the most function calls. In the case of the question I was referring to above, the most time was spend in python's close function (closed after each call of pd.read_csv unless you keep the file open.)


Note: The logic looks ok, you're not assigning to a copy. As you can see in your smaller example: the code works!

Sign up to request clarification or add additional context in comments.

Comments

1

The problem was in code processing certain columns. There was such strings:

chunk[chunk[393] > 50] = np.nan

Instead of

chunk[chunk[[393]] > 50] = np.nan

If there was N:

chunk[393][N] > 50

then all the row transformed to array with NaN


Thank to all for your help.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.