Efficient implmentation of Python multiprocesssing Pool

Ask Question

Asked 5 years, 7 months ago

Modified 5 years, 7 months ago

Viewed 44 times

I have two codes. One is pooled (multiprocessing) version of the other. However, the parallel version with even 1 processor is taking a long time whereas the serial version finishes in ~15 sec. Can someone help to accelerate the second version.

Serial

    import numpy as np, time
    def mapTo(d):   
        global tree
        for idx, item in enumerate(list(d), start=1):
            tree[str(item)].append(idx)

    data=np.random.randint(1,4, 20000000)
    tree = dict({"1":[],"2":[],"3":[]})
    s= time.perf_counter()
    mapTo(data)
    e = time.perf_counter()
    print("elapsed time:",e-s)

takes: ~15 sec

Parallel

from multiprocessing import Manager, Pool
from functools import partial
import numpy as np
import time

def mapTo(i_d,tree):
    idx,item = i_d
    l = tree[str(item)]
    l.append(idx)
    tree[str(item)] = l

manager = Manager()
data    = np.random.randint(1,4, 20000000)
# sharedtree= manager.dict({"1":manager.list(),"2":manager.list(),"3":manager.list()})
sharedtree = manager.dict({"1":[],"2":[],"3":[]})
s= time.perf_counter()
with Pool(processes=1) as pool:
    pool.map(partial(mapTo, tree=sharedtree), list(enumerate(data,start=1)))
e = time.perf_counter()
print("elapsed time:",e-s)

asked May 4, 2020 at 13:55

CKM

2,0012 gold badges25 silver badges33 bronze badges

l = tree[str(item)]; l.append(idx); tree[str(item)] = l - what does that do? - get the value for a key which should be a list; append something to it; then assign it as the value for the key? - that last step seems unnecessary.

wwii
– wwii

2020-05-04 16:35:23 +00:00
Commented May 4, 2020 at 16:35
d = {'1':list(np.where(data==1)[0]),'2':list(np.where(data==2)[0]),'3':list(np.where(data==3)[0])} ??

wwii
– wwii

2020-05-04 16:42:29 +00:00
Commented May 4, 2020 at 16:42
I'd break assignment in 3 steps since one line assignment in an embedded list (dict's list) was not working in py 3.5. check here. Can't use the suggested sol since in Pool.map function, an iterable is passed and function should be able to handle one data point.

CKM
– CKM

2020-05-05 03:13:59 +00:00
Commented May 5, 2020 at 3:13
The (non-concurrent) mapTo function takes 58 seconds on my computer. Whereas using np.where (my previous comment) takes about 1.3 seconds. If that improvement is sufficient you can avoid multiprocessing.

wwii
– wwii

2020-05-05 15:07:37 +00:00
Commented May 5, 2020 at 15:07
Actually, I've put a simplified dict. In reality, it is 4 level nested dict where no of lists is around 200K. I'm inserting data to these lists and my data size is around 20M. My plan is to break the data, say in batches of 500K and run concurrently.

CKM
– CKM

2020-05-05 15:26:44 +00:00
Commented May 5, 2020 at 15:26

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Efficient implmentation of Python multiprocesssing Pool

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked