1

I've got a large DataFrame(600k,2) named data and basically I have in the second column a set of 50k unique values distributed along the data.

The data looks like this

    image_id     term 
0   56127        23001  
1   56127        763003  
2   56127        51002  
3   26947        581007  
4   26947        14001  
5   26947        95000  
6   26947        92000  
7   26947        62004  
8   26947        224007
...600k more

On the other hand I have a Series named terms_indexed with an index composed of this 50k terms like this.

            NewTerm
Term                  
23001          9100
763003          402
51002         10608
581007          900
14001         42107
95000           900
92000          4002
62004         42107
224007         9100
...50k more

But I want to reemplace those values in the original DataFrame efficiently using the Series with the indexed terms. So far I have done it with the following line

for i in range(data.shape[0]):
        data.loc[i, 'term'] = int(terms_indexed.ix[data.iloc[i][1]])

However it takes so much time doing this replacement operation. About 35minutes in an intel core i7 with 8GB ram. I wanted to know if there is a better way to do this operation. Thanks in advance

2
  • 2
    If you set the index on term column on your large df then you could just call update like so large_df.update(other_df) Commented Sep 3, 2014 at 21:09
  • This might be a use case for the shiny categorical dtype (for the term column). Commented Sep 3, 2014 at 21:53

1 Answer 1

4

If I understand your situation right, you can just do df['term'] = df['term'].map(terms_indexed). Doing series1.map(series2) "translates" series1 by using its values as indexes into series2.

Sign up to request clarification or add additional context in comments.

7 Comments

I just compared update against map and was a little surprised that map outpferformed update, for 90k dataframe update takes 19.4ms vs 8.61ms for map.
The reason I say this is that Jeff has always commented to me that map and apply are last resort methods so I thought update would perform better
@EdChum: I think map with a function argument can be slow. map with a series argument should be pretty fast. But I'm not up on pandas internals so I don't know for sure.
I think I was told as a comment that map is a cython optimised for loop, not sure if that changes if the datatype is a series. Anyway this was the comment from Jeff
@EdChum if arg is a Series/dict (and has a unique index), then this is a completely vectorized operation (so very fast); it even better than cython, just a take really.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.