2

I'm trying to replace some string values in an index column in a pandas data frame. The indexes are country names, and I want to replace strings like 'United Kingdom of England and Northern Ireland' with 'UK'.

The data framelooks like this:

data = ['12','13','14', '15']
df = pd.DataFrame(data, index = ['Republic of Korea','United States of America20', 'United Kingdom of Great Britain and Northern Ireland19','China, Hong Kong Special Administrative Region'],columns=['Country'])

I have tried:

d={"Republic of Korea": "South Korea",
   "United States of America20": "United States",
    "United Kingdom of Great Britain and Northern Ireland19": "United Kingdom",
    "China, Hong Kong Special Administrative Region": "Hong Kong"}  
df.index = df.index.str.replace(d)

Unfortunately, I just get an error message that replace is missing a positional argument.

2 Answers 2

2

In pandas for replace values in index or columns is used function rename:

df = df.rename(d)
print (df)
               Country
South Korea         12
United States       13
United Kingdom      14
Hong Kong           15

For me timings are practically same:

df = pd.concat([df] * 100000)

In [11]: %timeit df.rename(d)
10 loops, best of 3: 75.7 ms per loop

In [12]: %timeit pd.Series(df.index).replace(d)
10 loops, best of 3: 71.8 ms per loop

In [13]: %timeit pd.Series(df.index.values).replace(d)
10 loops, best of 3: 75.3 ms per loop
Sign up to request clarification or add additional context in comments.

2 Comments

Can you please add pd.Series(df.index.values).replace(d) to your timeit list as well?
Sure, no problem. Done.
1

You could initialise a series and call pd.Series.replace:

df   
                                                   Country
Republic of Korea                                       12
United States of America20                              13
United Kingdom of Great Britain and Northern Ir...      14
China, Hong Kong Special Administrative Region          15


df.index = pd.Series(df.index).replace(d)
df

               Country
South Korea         12
United States       13
United Kingdom      14
Hong Kong           15

Timings

df = pd.concat([df] * 100000)

%timeit df.rename(d)
10 loops, best of 3: 116 ms per loop

%timeit pd.Series(df.index).replace(d)
10 loops, best of 3: 96.7 ms per loop

I can squeeze out more speed using df.index.values:

%timeit pd.Series(df.index.values).replace(d)
10 loops, best of 3: 88 ms per loop

Timings will vary on your machine, so be sure to do your own tests before deciding what method to go with.

3 Comments

Hmmm, what is your pandas version? I test timings too and it is very similar. I use 0.21.0 under win7 with python3.
@jezrael 0.21 on python3.4 (Ipython5), MacOS. My machine is a bit old so timings would vary.
Thanks for your help. Great to see all the alternatives

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.