2

I'm trying to parallelize a function I wrote for sequential program. Below is the input and output

Input 1, list of string : ["foo bar los angles", "foo bar new york", ...]

Input 2, list of string as dictionary: ["los angles", "new york"..]

I want to remove all string in input 2 from input 1. So the output will be like:

["foo bar", "foo bar"].

I'm able to do it using a double for loop.

res = []
for s1 in input1:
    for s2 in input2:
        if s2 in s1:
            res.append(s1.replace(s2, ""))

But this run a little slow (more than 10 minutes on my macbook pro) on 2 million size of list input1 (input 2 is couple of thousands).

I found a way to use python's multithreading.dummy.Pool. And use pool.map along with a global variable to parallelize it. But I'm concern about the usage of global variable. Is it safe to do so? Is there a better way to for python multithread to share a variable (May be like apache spark's mapPartions)?

I'm using Python 2.7 now. So I'd prefer answer use python2.

4
  • 2
    As an intermediary step before you go ahead with the parallelization, you could probably make your code a lot faster just by switching the lists to sets (if that's okay for your problem) Commented Jun 14, 2016 at 22:45
  • Do you need to preserve ordering of the original list? Commented Jun 14, 2016 at 22:47
  • @Tyler Well. I don't think use set will boost up performance. Because I still need to traverse the whole set. My requirement is remove substring in items of input1 if that substring is in input2. So I don't think use set will improve performance. Commented Jun 14, 2016 at 23:13
  • @rrauenza No, I don't need to preserve the ordering. Commented Jun 14, 2016 at 23:14

1 Answer 1

2

It's generally recommended to avoid multithreading when wanting performance because of the GIL. Luckily we have multiprocessing!

#!/usr/bin/python
import itertools
import multiprocessing

in1 = ["foo bar los angles", "foo bar new york",]
in2 = ["los angles", "new york",]

results = []

def sub(arg):
    s1, s2 = arg
    if s2 in s1:
        return s1.replace(s2, "")

pool = multiprocessing.Pool(4)
for result in pool.imap(sub, itertools.product(in1, in2)):
    if result is not None:
        results.append(result)

print results

It sounds like your 2 million item list is already in memory, so you'll want to use imap not map in order to keep from turning the product into a thousands of millions item list. I also use itertools.product to do the cartesian product of your inputs -- which is what your nested loop was doing.

Your requirements were a little vague in terms of uniqueness -- you were only adding to the results if you found a match.

Since we only add to results in the main body, there is no need to worry about the global results variable. If you were using multithreading your map function could write directly to the results variable because of the GIL's protection ....but your concurrency would also be suffering from the GIL as well.

Note you can tune the imap by passing a large chunksize. You can optimize further by relaxing the ordered requirement by using imap_unordered. See multiprocessing for more information.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you rrauenze. But just a follow up question, what if I want to add another input for function sub? Assume input1 is list of list [["foo bar los angles"],["foo bar new york"]], and I need to pass in a int variable 0 to access the inner list and get the string. Is there a way not using global variable to share it?
I don't understand thus question. Can you revise your posted question to be more accurate regarding what you want to do?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.