How to share a variable (not for modification) in python using multithread?

Question

I'm trying to parallelize a function I wrote for sequential program. Below is the input and output

Input 1, list of string : ["foo bar los angles", "foo bar new york", ...]

Input 2, list of string as dictionary: ["los angles", "new york"..]

I want to remove all string in input 2 from input 1. So the output will be like:

["foo bar", "foo bar"].

I'm able to do it using a double for loop.

res = []
for s1 in input1:
    for s2 in input2:
        if s2 in s1:
            res.append(s1.replace(s2, ""))

But this run a little slow (more than 10 minutes on my macbook pro) on 2 million size of list input1 (input 2 is couple of thousands).

I found a way to use python's multithreading.dummy.Pool. And use pool.map along with a global variable to parallelize it. But I'm concern about the usage of global variable. Is it safe to do so? Is there a better way to for python multithread to share a variable (May be like apache spark's mapPartions)?

I'm using Python 2.7 now. So I'd prefer answer use python2.

As an intermediary step before you go ahead with the parallelization, you could probably make your code a lot faster just by switching the lists to sets (if that's okay for your problem) — Tyler
– Tyler, Commented Jun 14, 2016 at 22:45
@Tyler Well. I don't think use set will boost up performance. Because I still need to traverse the whole set. My requirement is remove substring in items of input1 if that substring is in input2. So I don't think use set will improve performance. — xuanyue
– xuanyue, Commented Jun 14, 2016 at 23:13

Community · Accepted Answer · 2017-05-23 10:28:48Z

2

It's generally recommended to avoid multithreading when wanting performance because of the GIL. Luckily we have multiprocessing!

#!/usr/bin/python
import itertools
import multiprocessing

in1 = ["foo bar los angles", "foo bar new york",]
in2 = ["los angles", "new york",]

results = []

def sub(arg):
    s1, s2 = arg
    if s2 in s1:
        return s1.replace(s2, "")

pool = multiprocessing.Pool(4)
for result in pool.imap(sub, itertools.product(in1, in2)):
    if result is not None:
        results.append(result)

print results

It sounds like your 2 million item list is already in memory, so you'll want to use imap not map in order to keep from turning the product into a thousands of millions item list. I also use itertools.product to do the cartesian product of your inputs -- which is what your nested loop was doing.

Your requirements were a little vague in terms of uniqueness -- you were only adding to the results if you found a match.

Since we only add to results in the main body, there is no need to worry about the global results variable. If you were using multithreading your map function could write directly to the results variable because of the GIL's protection ....but your concurrency would also be suffering from the GIL as well.

Note you can tune the imap by passing a large chunksize. You can optimize further by relaxing the ordered requirement by using imap_unordered. See multiprocessing for more information.

edited May 23, 2017 at 10:28

CommunityBot

11 silver badge

answered Jun 14, 2016 at 23:12

rrauenza

7,1434 gold badges39 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

xuanyue Over a year ago

Thank you rrauenze. But just a follow up question, what if I want to add another input for function sub? Assume input1 is list of list [["foo bar los angles"],["foo bar new york"]], and I need to pass in a int variable 0 to access the inner list and get the string. Is there a way not using global variable to share it?

rrauenza Over a year ago

I don't understand thus question. Can you revise your posted question to be more accurate regarding what you want to do?

Collectives™ on Stack Overflow

How to share a variable (not for modification) in python using multithread?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related