0

I have a large file as input to my python code and it will produce the corresponding output file. However, it takes too much time and I want to speed it up.

Right now, I split the large file into 1000 smaller files. I want to have a small script that will launch 1000 threads, each thread uses my original python code and has its own output file.

Can anyone give me a sample/example code?

4
  • that wont speed it up(much if at all) ... you should only split it into as many parts as there are cores available ... and use multiprocessing library... the only reason to use threads in python is when you have a GUI you dont want to block ... otherwise you should use multiprocessing if you need concurrent data processing Commented Sep 4, 2014 at 17:40
  • Is your work actually dominated by CPU (processing), or by I/O (reading and writing the files)? You need to profile to figure that out first, before you decide how to parallelize things. Commented Sep 4, 2014 at 17:41
  • It is by I/O, each line costs 4ms CPU, I assume I/O should be higher. Commented Sep 4, 2014 at 17:44
  • 4ms is actually a pretty good amount of CPU to be spending on a single line of input; the I/O costs per line should amortize to much less than that, unless you're reading and writing to a network share drive or something. Commented Sep 4, 2014 at 17:55

3 Answers 3

5

First, using 1000 threads will almost certainly slow things down, not speed it up. Even if your code is completely I/O bound, 1000 is pushing the limits of many platforms' schedulers, and you'll spend more time context switching than doing actual work.

Next, you need to know whether your code is CPU-bound (that is, doing actual processing on information in memory) or I/O-bound (that is, waiting on things like disk reads and writes).


If your code is CPU-bound, and you can keep the CPU busy pretty consistently, you want exactly 1 thread per core. That way, you get the maximum amount of parallelism with the minimum amount of context switching (and cache thrashing, assuming most of the work is done on either immutable or non-shared values).

Also (unless that work is being done in specially-designed C extensions like numpy), you want these threads to be in separate processes, because only 1 thread per process can run the Python interpreter at a time, thanks to the Global Interpreter Lock.

So, what you want is almost certainly a process pool. The easiest way to do that is to use the concurrent.futures.ProcessPoolExecutor, possibly with a max_workers argument (maybe start with 16, then try tweaking it up and down to see if it helps).


If, on the other hand, your code is mostly I/O-bound, then a couple dozen threads is reasonable, especially if the delays are unpredictable, but not 1000. And threads in the same process will work fine, because one thread can run the Python interpreter while the others are all waiting for the OS to finish a disk operation.

So, in this case, you want a concurrent.futures.ThreadPoolExecutor.


If you're not sure, and don't know how to find out, build it with a thread pool first, then use ActivityMonitor or whatever Windows now calls its process manager or your favorite of the 300 options on Linux to watch it run; if you end up with one core at 100% and the others below 25%, then you're too CPU-bound to be using threads. Fortunately, switching to a process pool is a trivial change—replace ThreadPoolExecutor with ProcessPoolExecutor, and remove the max_workers argument so Python will pick the best default, and now you're done.


In either case, the examples in the docs are good enough that there's no reason to ask for other sample code.

Sign up to request clarification or add additional context in comments.

Comments

1
  • If you don't have 1000 processors, split in 1000 have no interest... On contrary, big overhead...
  • multithreading is for manage I/O blocking more efficiently, not to parallelize processing work.
  • If your problem are I/O from the same device, making more will increase load on it and increase overhead (head moving, caching trash...)

What you searching is more multiprocessing: https://docs.python.org/2/library/multiprocessing.html

Comments

1

If you decided to go with multiprocessing then you will do it in a very similar way. You can try something like this:

import Queue
from threading import Thread

file_list = ['filea', 'fileb']

def do_stuff(q):
    while True:
        try:
            file_name = q.get(False)
        except Queue.Empty:
            # Handle empty queue here
            break
        # do what ever you need here
        print file_name
        q.task_done()

q = Queue.Queue(maxsize=0)
num_threads = 2

for x in file_list:
  q.put(x)

for i in range(num_threads):
  worker = Thread(target=do_stuff, args=(q,))
  worker.setDaemon(True)
  worker.start()

q.join()

3 Comments

Why build a pool yourself when the multiprocessing library has one built-in (which also adds all kinds of features you haven't built, like returning values, properly signaling completion and waiting, etc.), and concurrent.futures (or the futures backport) has an even easier to use executor?
@abarnert Agree, but it's just an example, to show an idea.
OK, but why build an example in a couple dozen lines of doing things the hard way and leaving things out, when you could write an example in a few lines of code the easy way and cover everything?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.