How to replace efficiently values on a pandas DataFrame?

Question

I've got a large DataFrame(600k,2) named data and basically I have in the second column a set of 50k unique values distributed along the data.

The data looks like this

    image_id     term 
0   56127        23001  
1   56127        763003  
2   56127        51002  
3   26947        581007  
4   26947        14001  
5   26947        95000  
6   26947        92000  
7   26947        62004  
8   26947        224007
...600k more

On the other hand I have a Series named terms_indexed with an index composed of this 50k terms like this.

            NewTerm
Term                  
23001          9100
763003          402
51002         10608
581007          900
14001         42107
95000           900
92000          4002
62004         42107
224007         9100
...50k more

But I want to reemplace those values in the original DataFrame efficiently using the Series with the indexed terms. So far I have done it with the following line

for i in range(data.shape[0]):
        data.loc[i, 'term'] = int(terms_indexed.ix[data.iloc[i][1]])

However it takes so much time doing this replacement operation. About 35minutes in an intel core i7 with 8GB ram. I wanted to know if there is a better way to do this operation. Thanks in advance

If you set the index on term column on your large df then you could just call update like so large_df.update(other_df) — EdChum
– EdChum, Commented Sep 3, 2014 at 21:09
This might be a use case for the shiny categorical dtype (for the term column). — Andy Hayden
– Andy Hayden, Commented Sep 3, 2014 at 21:53

BrenBarn · Accepted Answer · 2014-09-03 21:11:50Z

4

If I understand your situation right, you can just do df['term'] = df['term'].map(terms_indexed). Doing series1.map(series2) "translates" series1 by using its values as indexes into series2.

answered Sep 3, 2014 at 21:11

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

EdChum Over a year ago

I just compared update against map and was a little surprised that map outpferformed update, for 90k dataframe update takes 19.4ms vs 8.61ms for map.

EdChum Over a year ago

The reason I say this is that Jeff has always commented to me that map and apply are last resort methods so I thought update would perform better

BrenBarn Over a year ago

@EdChum: I think map with a function argument can be slow. map with a series argument should be pretty fast. But I'm not up on pandas internals so I don't know for sure.

EdChum Over a year ago

I think I was told as a comment that map is a cython optimised for loop, not sure if that changes if the datatype is a series. Anyway this was the comment from Jeff

Jeff Over a year ago

@EdChum if arg is a Series/dict (and has a unique index), then this is a completely vectorized operation (so very fast); it even better than cython, just a take really.

|

Collectives™ on Stack Overflow

How to replace efficiently values on a pandas DataFrame?

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related