I have a large dataset ~1GB with ~14million rows. I want to clean up the country names; that is, replace CA with CANADA for example.
I tried:
mttt_pings.replace(['^CA$', '^US$', '^UNITED STATES$', '^MX$', '^TR$', 'GB',
'^ENGLAND$', '^AU$', '^FR$', '^KOREA, REPUB OF$',
'^CONGO, DEM REP.$', '^SYRIA$', '^DOMINICAN REP.$',
'^RUSSIA$', '^TAIWAN$', '^UAE$', '^LIBYA$'],
['CANADA', 'UNITED STATES OF AMERICA',
'UNITED STATES OF AMERICA', 'MEXICO', 'TURKEY',
'UNITED KINGDOM', 'UNITED KINGDOM', 'AUSTRALIA', 'FRANCE',
'KOREA, REPUBLIC OF', 'CONGO', 'SYRIA ARAB REPUBLIC',
'DOMINICAN REPUBLIC', 'RUSSIA FEDERATION',
'TAIWAN, PROVINCE OF CHINA', 'UNITED ARAB EMIRATES',
'LIBYAN ARAB JAMAHIRIYA'],
regex = True, inplace = True)
This isn't even the full replacement list just a subset. This ran for ~30mins before I quit the process.
I then tried writing individual replaces but that was still too slow.
- Is there a better (more efficient) way to execute pandas replace on large number of rows?
- Would a function of if statements be wiser and then use
df.apply(function)? - Or am I just missing something?
A sample set would look like:
import time
import pandas as pd
t0 = time.time()
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd']})
df.replace({'^a$': 'America'}, regex = True, inplace = True)
df.replace({'^b$': 'Bahamas'}, regex = True, inplace = True)
df.replace({'^c$': 'Congo'}, regex = True, inplace = True)
df.replace({'^e$': 'Europe'}, regex = True, inplace = True)
df.replace({'^a$': 'Dominican Republic'}, regex = True, inplace = True)
tf = time.time()
total = tf - t0
Obviously this set is too small to fully replicate the time issues.
For this case, four runs yields: tf = 0.00300002, tf = 00299978, tf = 0.00200009, and tf = 0.00300002.
import time
import pandas as pd
t0 = time.time()
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd']})
df.replace(['^a$', '^b$', '^c$', '^d$', '^e$'],
['America', 'Bahamas', 'Congo', 'Dominican Republic', 'Europe'],
regex = True, inplace = True)
tf = time.time()
total = tf - t0
Here we get tf = 0.0019998, tf = 0.00200009, tf = 0.00200009, and tf = 0.00200009
So it looks like the list version of replace is faster but still on large datasets it is really slow. Any ideas?