Python/Pandas: replacing certain values in multiple columns of large dataset

Question

I have a small dataframe containing 320k rows and 450 of columns. There are some of lists with column numbers:

list1 = [1,3,5,...]
list2 = [4,9,...]
...

My goal is to replace certain values in each column from current list and then to save it:

df[df[list1] > 7] = np.nan
df[df[list2] >90] = np.nan
...

The size of dataframe made me to do it by chunks:

for chunk in pd.read_csv(filePrev,chunksize=10000,header=None):
>>>  chunk[chunk[list1] >= 7] = np.nan
>>>  chunk[chunk[list2] >= 90] = np.nan
...
>>>  chunk.to_csv(newFile,mode='a',header=False,index=False)

But there was an incorrect work: I've run this code many times, and it mostly couldn't finish working (Memory Errors in IPython or application just closed by Windows), with any chunksize value. But when it finished, it replaced all the values to NaN in most of the strings, and there was some of the strings where all was replaced correctly.

I've tried the same logic on the little piece of the same dataset and it works correctly!

In [11]: df = pd.read_csv(filePrev,nrows=5,usecols=[1,2,3,4,5,6,7],header=None)

In [12]: df
Out[12]:
   1  2  3  4  5  6  7
0  1  1  1  1  1  1  1
1  3  1  1  1  2  1  1
2  3  1  1  1  1  1  1
3  3  1  1  1  2  1  2
4  3  1  1  1  1  1  1

In [13]: list = [1,7]

In [14]: df[df[list] > 1] = np.nan

In [15]: df
Out[15]:
    1  2  3  4  5  6   7
0   1  1  1  1  1  1   1
1 NaN  1  1  1  2  1   1
2 NaN  1  1  1  1  1   1
3 NaN  1  1  1  2  1 NaN
4 NaN  1  1  1  1  1   1

So, any ideas about it? Can we implement it in "chunking" mode, or there is another way (then I need an example)? I just want to replace certain values to NaN... :)

It might because the way you're selecting data returns a copy. Try using select pandas.pydata.org/pandas-docs/stable/indexing.html — canyon289
– canyon289, Commented May 5, 2015 at 1:48

Community · Accepted Answer · 2017-05-23 12:24:06Z

1

It's possible this could be improved by keeping the file open, rather than opening the file each time in append mode:

with open(newFile, 'a') as f:
    for chunk in pd.read_csv(filePrev,chunksize=10000,header=None):
        chunk[chunk[list1] >= 7] = np.nan
        chunk[chunk[list2] >= 90] = np.nan
        chunk.to_csv(f, header=False, index=False)

Someone reported this behaviour here recently, and this change gave them a 98.3% performance gain on Windows (I only saw around 25% on osx).

If you run your python code with Profile or (ipython's) %prun, you can see what's being called for the most time and the most function calls. In the case of the question I was referring to above, the most time was spend in python's close function (closed after each call of pd.read_csv unless you keep the file open.)

Note: The logic looks ok, you're not assigning to a copy. As you can see in your smaller example: the code works!

edited May 23, 2017 at 12:24

CommunityBot

11 silver badge

answered May 5, 2015 at 2:58

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Chernyavski.aa · Accepted Answer · 2015-05-05 11:13:12Z

1

The problem was in code processing certain columns. There was such strings:

chunk[chunk[393] > 50] = np.nan

Instead of

chunk[chunk[[393]] > 50] = np.nan

If there was N:

chunk[393][N] > 50

then all the row transformed to array with NaN

Thank to all for your help.

answered May 5, 2015 at 11:13

Chernyavski.aa

1,1831 gold badge9 silver badges17 bronze badges

Collectives™ on Stack Overflow

Python/Pandas: replacing certain values in multiple columns of large dataset

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related