Parallel Processing using Multiprocessing in Python

Question

I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:

I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:

def split_country(data):
    d_list = []
    for index, row in data.iterrows():
        for value in str(row['Country']).split(','):
            d_list.append({'Name':row['Name'], 
                       'value':value})
    data = data.append(d_list, ignore_index=True)
    data = data.groupby('Name')['value'].value_counts()
    data = data.unstack(level=-1).fillna(0)
    return (data)

The final output is something like this:

I'm trying to parallelize the above process by passing my dataframe (df) using the following:

import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])

But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help

in any case, please provide a complete example, i.e. a runnable program that has the same df as you're doing. — Antti Haapala
– Antti Haapala, Commented Aug 8, 2020 at 19:13

Trenton McKinney · Accepted Answer · 2020-08-08 21:21:14Z

multiprocessing is probably not required here. Using pandas vectorized methods will be sufficient to quickly produce the desired result.
- For a test DataFrame with 1M rows, the following code took 1.54 seconds.
First, use pandas.DataFrame.explode on the column of lists
- If the column is strings, first use ast.literal_eval to convert them to list type
  - df.countries = df.countries.apply(ast.literal_eval)
  - If the data is read from a CSV file, use df = pd.read_csv('test.csv', converters={'countries': literal_eval})
For this question, it's better to use pandas.get_dummies to get a count of each country per name, then pandas.DataFrame.groupby on 'name', and aggregate with .sum

import pandas as pd
from ast import literal_eval

# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}

# create the dataframe
df = pd.DataFrame(data)

# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)

# explode the lists
df = df.explode('countries')

# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()

# display(df_counts)
    name  Canada  China  UK  USA
0   Jack       0      1   1    0
1  James       1      0   0    1
2   John       0      0   1    1

Collectives™ on Stack Overflow

Parallel Processing using Multiprocessing in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related