0

I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:

enter image description here

I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:

def split_country(data):
    d_list = []
    for index, row in data.iterrows():
        for value in str(row['Country']).split(','):
            d_list.append({'Name':row['Name'], 
                       'value':value})
    data = data.append(d_list, ignore_index=True)
    data = data.groupby('Name')['value'].value_counts()
    data = data.unstack(level=-1).fillna(0)
    return (data)

The final output is something like this:

enter image description here

I'm trying to parallelize the above process by passing my dataframe (df) using the following:

import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])

But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help

1
  • in any case, please provide a complete example, i.e. a runnable program that has the same df as you're doing. Commented Aug 8, 2020 at 19:13

1 Answer 1

1
  • multiprocessing is probably not required here. Using pandas vectorized methods will be sufficient to quickly produce the desired result.
    • For a test DataFrame with 1M rows, the following code took 1.54 seconds.
  • First, use pandas.DataFrame.explode on the column of lists
    • If the column is strings, first use ast.literal_eval to convert them to list type
      • df.countries = df.countries.apply(ast.literal_eval)
      • If the data is read from a CSV file, use df = pd.read_csv('test.csv', converters={'countries': literal_eval})
  • For this question, it's better to use pandas.get_dummies to get a count of each country per name, then pandas.DataFrame.groupby on 'name', and aggregate with .sum
import pandas as pd
from ast import literal_eval

# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}

# create the dataframe
df = pd.DataFrame(data)

# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)

# explode the lists
df = df.explode('countries')

# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()

# display(df_counts)
    name  Canada  China  UK  USA
0   Jack       0      1   1    0
1  James       1      0   0    1
2   John       0      0   1    1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.