Python: what is an efficient way to replace strings in Pandas

Question

I have a large dataset ~1GB with ~14million rows. I want to clean up the country names; that is, replace CA with CANADA for example.

I tried:

mttt_pings.replace(['^CA$', '^US$', '^UNITED STATES$', '^MX$', '^TR$', 'GB',
                    '^ENGLAND$', '^AU$', '^FR$', '^KOREA, REPUB OF$',
                    '^CONGO, DEM REP.$', '^SYRIA$', '^DOMINICAN REP.$',
                    '^RUSSIA$', '^TAIWAN$', '^UAE$', '^LIBYA$'], 
                   ['CANADA', 'UNITED STATES OF AMERICA', 
                   'UNITED STATES OF AMERICA', 'MEXICO', 'TURKEY',
                   'UNITED KINGDOM', 'UNITED KINGDOM', 'AUSTRALIA', 'FRANCE',
                   'KOREA, REPUBLIC OF', 'CONGO', 'SYRIA ARAB REPUBLIC',
                   'DOMINICAN REPUBLIC', 'RUSSIA FEDERATION',
                   'TAIWAN, PROVINCE OF CHINA', 'UNITED ARAB EMIRATES',
                   'LIBYAN ARAB JAMAHIRIYA'], 
regex = True, inplace = True)

This isn't even the full replacement list just a subset. This ran for ~30mins before I quit the process.

I then tried writing individual replaces but that was still too slow.

Is there a better (more efficient) way to execute pandas replace on large number of rows?
Would a function of if statements be wiser and then use df.apply(function)?
Or am I just missing something?

A sample set would look like:

import time
import pandas as pd
t0 = time.time()

df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                   'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd']})

df.replace({'^a$': 'America'}, regex = True, inplace = True)
df.replace({'^b$': 'Bahamas'}, regex = True, inplace = True)
df.replace({'^c$': 'Congo'}, regex = True, inplace = True)
df.replace({'^e$': 'Europe'}, regex = True, inplace = True)
df.replace({'^a$': 'Dominican Republic'}, regex = True, inplace = True)
tf = time.time()
total = tf - t0

Obviously this set is too small to fully replicate the time issues.

For this case, four runs yields: tf = 0.00300002, tf = 00299978, tf = 0.00200009, and tf = 0.00300002.

import time
import pandas as pd
t0 = time.time()

df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                   'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd']})

df.replace(['^a$', '^b$', '^c$', '^d$', '^e$'], 
           ['America', 'Bahamas', 'Congo', 'Dominican Republic', 'Europe'], 
regex = True, inplace = True)
tf = time.time()
total = tf - t0

Here we get tf = 0.0019998, tf = 0.00200009, tf = 0.00200009, and tf = 0.00200009

So it looks like the list version of replace is faster but still on large datasets it is really slow. Any ideas?

@MaxU I never said I didn't want too. It just takes entirely too long so I am trying to find a better way. — dustin
– dustin, Commented Mar 30, 2016 at 13:58
I'm confused. Are these all in the same column? Are they randomly scattered throughout the dataframe? I think you want to vectorize along the appropriate axes, and limit to only the columns you actually care about. That should help speed it up, no? — szeitlin
– szeitlin, Commented Mar 30, 2016 at 14:52
@szeitlin the replacement is all taking place in just the CTRY_NM column. How can I specify that the replacement only take place there? — dustin
– dustin, Commented Mar 30, 2016 at 15:08
@dustin, "Nested dictionaries, e.g., {‘a’: {‘b’: nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with nan." from documentation — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Mar 30, 2016 at 15:25

szeitlin · Accepted Answer · 2016-03-30 17:47:26Z

1

For most methods that exist on DataFrame, there's a Series equivalent that works on a column. This doesn't seem to be in the documentation for 0.18 (yet!).

This worked for me:

df['CTRY_NM'].replace(to_replace=['^b','^c'], value=['America','Bahamas'], regex=True )

Should be at least a little bit faster?

answered Mar 30, 2016 at 17:47

szeitlin

3,3602 gold badges25 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python: what is an efficient way to replace strings in Pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related