i have a Dataframe of 200k lines, i want to split into parts and call my function S_Function for each partition.
def S_Function(df):
#mycode here
return new_df
Main program
N_Threads = 10
Threads = []
Out = []
size = df.shape[0] // N_Threads
for i in range(N_Threads + 1):
begin = i * size
end = min(df.shape[0], (i+1)*size)
Threads.append(Thread(target = S_Function, args = (df[begin:end])) )
I run the threads & make the join :
for i in range(N_Threads + 1):
Threads[i].start()
for i in range(N_Threads + 1):
Out.append(Threads[i].join())
output = pd.concat(Out)
The code is working perfectly but the problem is that using threading.Thread did not decrease the execution time.
Sequential Code : 16 minutes
Parallel Code : 15 minutes
Can someone explain what to improve, why this is not working well?
apply(). Thispandasfunction has multiple ways to be parallelized. One of them ispandarallel. You can have more information on this threadThreadfromthreadingmodule?