1

I have a list of sentences that has around 500,000 sentences. And also a list of concepts that have around 13,000,000 concepts. For each sentence I want to extract concepts from sentences in the order of sentence and write it to output.

For example, my python programme looks as follows.

import re

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

output = []
counting = 0

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

for sentence in sentences:
    output.append(find_all_concepts(sentence))

print(output)

The output is; [['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process']]

However, the order of the output is not important to me. i.e my output could also look likes follows (In other words the lists inside the output can be shuffled).

[['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

[['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

However, due to to the length of my sentences and concepts this program is still quite slow.

Is it possible to further improve the performance (in terms of time) using multithreading in python?

12
  • 7
    Multithreading in python is great when your threads are mainly waiting - for network resources, web requests, DB queries, disk I/O, etc. In this case it looks like your are CPU bound, in which case, multithreading wont help at all. Commented Jan 7, 2019 at 0:36
  • 1
    There are lots of other things I'd try before multithreading. Pypy, cython, smarter algorithm, etc. Commented Jan 7, 2019 at 1:23
  • 1
    How long does it take to create find_all_concepts? Commented Jan 7, 2019 at 3:53
  • 1
    What is the maximum concept word length - 2,3,4? Commented Jan 7, 2019 at 4:19
  • 1
    Python Multiprocessing vs Threading - pros/cons SO answer. There are other good answers on that page. Commented Jan 7, 2019 at 16:35

3 Answers 3

2

Whether multi-threading will yield an actual performance increase, does not just depend on the implementation in Python and the amount of data, it also depends on the hardware executing the program. In some cases, where the hardware offers no advantage, multi-threading may end up slowing things down due to increased overhead.

However, assuming you're running on a modern standard PC or better, you may see some improvement with multi-threading. The problem then is to set up a number of workers, pass the work to them and collect the results.

Staying close to your example structure, implementation and naming:

import re
import queue
import threading

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, l_out):
    while True:
        sentence = q_in.get()
        l_out.append(find_all_concepts(sentence))
        q_in.task_done()


# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []

# any reasonable number of workers
num_threads = 2
for i in range(num_threads):
    worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
    # once there's nothing but daemon threads left, Python exits the program
    worker.daemon = True
    worker.start()

# put all the input on the queue
for s in sentences:
    sentences_q.put(s)

# wait for the entire queue to be processed
sentences_q.join()
print(output)

User @wwii asked about multiple threads not really affecting performance for cpu-bound problems. Instead of using multiple threads, accessing the same output variable, you could also use multiple processes, access a shared output queue, like this:

import re
import queue
import multiprocessing

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, q_out):
    try:
        while True:
            sentence = q_in.get(False)
            q_out.put(find_all_concepts(sentence))
    except queue.Empty:
        pass


if __name__ == '__main__':
    # default maxsize of 0, infinite queue size
    sentences_q = multiprocessing.Queue()
    output_q = multiprocessing.Queue()

    # any reasonable number of workers
    num_processes = 2
    pool = multiprocessing.Pool(num_processes, do_find_all_concepts, (sentences_q, output_q))

    # put all the input on the queue
    for s in sentences:
        sentences_q.put(s)

    # wait for the entire queue to be processed
    pool.close()
    pool.join()
    while not output_q.empty():
        print(output_q.get())

More overhead still, but using CPU resources available on other cores as well.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks a lot. Can we improve the number of threads to like 10, assuming that I have increased hardware?
Yes, just change the num_threads = 2 line to whatever you think is optimal. You could test a bit with a smaller set (but larger than given in the example) and use some timing functions to decide what number of threads is optimal for your system. Bigger is not always better, as your hardware and OS can only do so much in parallel.
It depends @wwii, but since most modern CPU's have multiple cores and feature hyper-threading, having multiple threads could allow the program to make more efficient use of the available compute resources. Of course, resources like memory I/O, network I/O and disk I/O may be the actual bottlenecks, depending on the problem you're trying to solve. But since you ask about 'cpu bound' tasks, yes, as long as there are unused CPU resources that would be accessed by multi-threading.
I'm sorry @wwii, I misinterpreted - disregard that. If they are truly cpu bound, i.e. bottlenecked by the CPU, already fully utilising the capacity of the CPU they are running on - then probably not. You'd have to look for solutions using multiprocessing instead of multithreading.
1

This answer will address improving performance without using concurrency.


The way you structured your search you are looking for 13 million unique things in each sentence. You said it takes 3-5 minutes for each sentence and that the word lengths in concepts range from one to ten.

I think you can improve the search time by making a set of concepts (either initially when constructed or from your list) then splitting each sentence into strings of one to ten (consecutive) words and testing for membership in the set.

Example of a sentence split into 4 word strings:

'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'
# becomes
[('data', 'mining', 'is', 'the'),
 ('mining', 'is', 'the', 'process'),
 ('is', 'the', 'process', 'of'),
 ('the', 'process', 'of', 'discovering'),
 ('process', 'of', 'discovering', 'patterns'),
 ('of', 'discovering', 'patterns', 'in'),
 ('discovering', 'patterns', 'in', 'large'),
 ('patterns', 'in', 'large', 'data'),
 ('in', 'large', 'data', 'sets'),
 ('large', 'data', 'sets', 'involving'),
 ('data', 'sets', 'involving', 'methods'),
 ('sets', 'involving', 'methods', 'at'),
 ('involving', 'methods', 'at', 'the'),
 ('methods', 'at', 'the', 'intersection'),
 ('at', 'the', 'intersection', 'of'),
 ('the', 'intersection', 'of', 'machine'),
 ('intersection', 'of', 'machine', 'learning'),
 ('of', 'machine', 'learning', 'statistics'),
 ('machine', 'learning', 'statistics', 'and'),
 ('learning', 'statistics', 'and', 'database'),
 ('statistics', 'and', 'database', 'systems')]

Process:

concepts = set(concepts)
sentence = sentence.split()
#one word
for meme in sentence:
    if meme in concepts:
        #keep it
#two words
for meme in zip(sentence,sentence[1:]):
    if ' '.join(meme) in concepts:
        #keep it
#three words
for meme in zip(sentence,sentence[1:],sentence[2:]):
    if ' '.join(meme) in concepts:
        #keep it

Adapting an itertools recipe (pairwise) you can automate that process of making n-word strings from a sentence:

from itertools import tee
def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

Testing each sentence looks like this

sentence = sentence.strip().split()
for n in [1,2,3,4,5,6,7,8,9,10]:
    for meme in nwise(sentence,n):
        if ' '.join(meme) in concepts:
            #keep meme

I made a set of 13e6 random strings with 20 characters each to approximate concepts.

import random, string
data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))

Testing a four or forty character string for membership in data consistently takes about 60 nanoseconds. A one hundred word sentence has 955 one to ten word strings so searching that sentence should take ~60 microseconds.

The first sentence from your example 'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems' has 195 possible concepts (one to ten word strings). Timing for the following two functions is about the same: about 140 microseconds for f and 150 microseconds for g:

def f(sentence, data=data, nwise=nwise):
    '''iterate over memes in sentence and see if they are in data'''
    sentence = sentence.strip().split()
    found = []
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(sentence,n):
            meme = ' '.join(meme)
            if meme in data:
                found.append(meme)
    return found

def g(sentence, data=data, nwise=nwise):
    'make a set of the memes in sentence then find its intersection with data'''
    sentence = sentence.strip().split()
    test_strings = set(' '.join(meme) for n in range(1,11) for meme in nwise(sentence,n))
    found = test_strings.intersection(data)
    return found

So these are just approximations since I'm not using your actual data but it should speed things up quite a bit.

After testing with your example data I found that g won't work if a concept appears twice in a sentence.


So here it is all together with the concepts listed in the order they are found in each sentence. The new version of f will take longer but the added time should be relatively small. If possible would you post a comment letting me know how much longer it is than the original? (I'm curious).

from itertools import tee

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

concepts = set(concepts)

def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

def f(sentence, concepts=concepts, nwise=nwise):
    '''iterate over memes in sentence and see if they are in concepts'''
    indices = set()
    #print(sentence)
    words = sentence.strip().split()
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(words,n):
            meme = ' '.join(meme)
            if meme in concepts:
                start = sentence.find(meme)
                end = len(meme)+start
                while (start,end) in indices:
                    #print(f'{meme} already found at character:{start} - looking for another one...') 
                    start = sentence.find(meme, end)
                    end = len(meme)+start
                indices.add((start, end))
    return [sentence[start:end] for (start,end) in sorted(indices)]


###########
results = []
for sentence in sentences:
    results.append(f(sentence))
    #print(f'{sentence}\n\t{results[-1]})')


In [20]: results
Out[20]: 
[['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'knowledge discovery', 'databases process', 'process']]

20 Comments

nwise is the function I showed, it uses itertools.tee. data is a set I made for testing it is supposed to approximate your concepts. see the edit.
Sorry I didn't notice that part of the spec - see the edit.
Good one: see the edit - I refactored f to keep track of indices and look for the concept again if it had already been found. I don't think it will increase the time any/appreciably.
So about 0.122 seconds per sentence? -seems a bit high. If you have enough ram you could try some multiprocessing tests with concurrent futures using this function.
No way I can check without your data. Do you understand how the function works? You can test it with that sentence, it only has 117 checks. I put some prints in the function that you can use to troubleshoot, you may need to add more. Sequential repeated words that are concepts will cause duplication of effort which is wasted.
|
1

Here are two solutions using concurrent.futures.ProcessPoolExecutor which will distribute the tasks to different processes. Your task appears to be cpu bound not i/o bound so threads probably won't help.

import re
import concurrent.futures

# using the lists in your example

re_concepts = [re.escape(t) for t in concepts]
all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL)

def f(sequence, regex=all_concepts):
    result = regex.findall(sequence)
    return result

if __name__ == '__main__':

    out1 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = [executor.submit(f, s) for s in sentences]
        for future in concurrent.futures.as_completed(futures):
            try:
                result = future.result()
            except Exception as e:
                print(e)
            else:
                #print(result)
                out1.append(result)   

    out2 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for result in executor.map(f, sentences):
            #print(result)
            out2.append(result)

Executor.map() has a chunksize parameter: the docs say sending chunks of greater than one item of the iterable could be beneficial. The function would need to be refactored to account for that. I tested this with a function that would just return what it was sent but regardless of the chunksize I specified the test function only returned single items. ¿go figure?

def h(sequence):
    return sequence

One drawback with Multiprocessing is that the data must be serialized/pickled to be sent to the process which takes time and might be significant for a compiled regular expression that large - it might defeat the gains from multiple processes.

I made a set of 13e6 random strings with 20 characters each to approximate your compiled regex.

data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))

Pickling to an io.BytesIO stream takes about 7.5 seconds and unpickling from a io.BytesIO stream takes 9 seconds. If using a multiprocessing solution, it may be beneficial to pickle the concepts object (in whatever form) to the hard drive once then have each process unpickle from the hard drive rather than pickling/unpickling on each side of the IPC each time a new process is created, definitely worth testing - YMMV . The pickled set is 380 MB on my hard drive.

When I tried some experiments with concurrent.futures.ProcessPoolExecutor I kept blowing up my computer because each process needed its own copy of the set and my computer just doesn't have enough ram.

I'm going to post another answer dealing with the method of testing for concepts in sentences.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.