0

I have a large dataset ~1GB with ~14million rows. I want to clean up the country names; that is, replace CA with CANADA for example.

I tried:

mttt_pings.replace(['^CA$', '^US$', '^UNITED STATES$', '^MX$', '^TR$', 'GB',
                    '^ENGLAND$', '^AU$', '^FR$', '^KOREA, REPUB OF$',
                    '^CONGO, DEM REP.$', '^SYRIA$', '^DOMINICAN REP.$',
                    '^RUSSIA$', '^TAIWAN$', '^UAE$', '^LIBYA$'], 
                   ['CANADA', 'UNITED STATES OF AMERICA', 
                   'UNITED STATES OF AMERICA', 'MEXICO', 'TURKEY',
                   'UNITED KINGDOM', 'UNITED KINGDOM', 'AUSTRALIA', 'FRANCE',
                   'KOREA, REPUBLIC OF', 'CONGO', 'SYRIA ARAB REPUBLIC',
                   'DOMINICAN REPUBLIC', 'RUSSIA FEDERATION',
                   'TAIWAN, PROVINCE OF CHINA', 'UNITED ARAB EMIRATES',
                   'LIBYAN ARAB JAMAHIRIYA'], 
regex = True, inplace = True)

This isn't even the full replacement list just a subset. This ran for ~30mins before I quit the process.

I then tried writing individual replaces but that was still too slow.

  • Is there a better (more efficient) way to execute pandas replace on large number of rows?
  • Would a function of if statements be wiser and then use df.apply(function)?
  • Or am I just missing something?

A sample set would look like:

import time
import pandas as pd
t0 = time.time()

df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                   'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd']})

df.replace({'^a$': 'America'}, regex = True, inplace = True)
df.replace({'^b$': 'Bahamas'}, regex = True, inplace = True)
df.replace({'^c$': 'Congo'}, regex = True, inplace = True)
df.replace({'^e$': 'Europe'}, regex = True, inplace = True)
df.replace({'^a$': 'Dominican Republic'}, regex = True, inplace = True)
tf = time.time()
total = tf - t0

Obviously this set is too small to fully replicate the time issues.

For this case, four runs yields: tf = 0.00300002, tf = 00299978, tf = 0.00200009, and tf = 0.00300002.

import time
import pandas as pd
t0 = time.time()

df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                   'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd']})

df.replace(['^a$', '^b$', '^c$', '^d$', '^e$'], 
           ['America', 'Bahamas', 'Congo', 'Dominican Republic', 'Europe'], 
regex = True, inplace = True)
tf = time.time()
total = tf - t0

Here we get tf = 0.0019998, tf = 0.00200009, tf = 0.00200009, and tf = 0.00200009

So it looks like the list version of replace is faster but still on large datasets it is really slow. Any ideas?

6
  • why don't you want to do it in one shot? Commented Mar 30, 2016 at 13:57
  • @MaxU I never said I didn't want too. It just takes entirely too long so I am trying to find a better way. Commented Mar 30, 2016 at 13:58
  • I'm confused. Are these all in the same column? Are they randomly scattered throughout the dataframe? I think you want to vectorize along the appropriate axes, and limit to only the columns you actually care about. That should help speed it up, no? Commented Mar 30, 2016 at 14:52
  • @szeitlin the replacement is all taking place in just the CTRY_NM column. How can I specify that the replacement only take place there? Commented Mar 30, 2016 at 15:08
  • @dustin, "Nested dictionaries, e.g., {‘a’: {‘b’: nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with nan." from documentation Commented Mar 30, 2016 at 15:25

1 Answer 1

1

For most methods that exist on DataFrame, there's a Series equivalent that works on a column. This doesn't seem to be in the documentation for 0.18 (yet!).

This worked for me:

df['CTRY_NM'].replace(to_replace=['^b','^c'], value=['America','Bahamas'], regex=True )

Should be at least a little bit faster?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.