I am transforming a single dataframe using multiple cpu cores and want to insert the result into MySQL.
Using the below code I observed only one active cpu core and no updates to MySQL. No error messages were produced.
The original dataframe pandas_df is never altered. All transformations to pandas_df are stored in result_df.
The code has been verified to work correctly in serial.
import multiprocessing as mp
from sqlalchemy import create_engine
engine = create_engine(MYSQL_STRING)
def function(pandas_df, tuple, engine):
#slice and dice pandas_df according to tuple
result_df.to_sql("TABLE_NAME", engine, if_exists='append')
pool = mp.Pool(processes=4)
for tuple in tuples:
pool.apply_async(est, args=(pandas_df, tuple, engine))
Most of the tutorials and guides I came across only passed strings inside args=().
There are however articles that do demonstrate the ability to pass numpy arrays: http://sebastianraschka.com/Articles/2014_multiprocessing_intro.html
I have also tried the above code using the map_async() method and/or inserting a return statement inside the function and there was no difference in behaviour.
I am open to trying different python modules. I need a solution which transforms a single dataframe in parallel and inserts the result into a database.