Multithreading / Multiprocessing in Python

Question

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.

Link to program: http://pastebin.com/KqEMMMCT

What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.

What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)

I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.

Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.

I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!

FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.

My suggestion is to keep the directory walking in the main thread and do the file processing in process (or thread) pool. — arthurprs
– arthurprs, Commented Apr 11, 2012 at 16:07
It'll definitely have to be multiprocessing, not threading, otherwise you won't gain anything because of the GIL. — g.d.d.c
– g.d.d.c, Commented Apr 11, 2012 at 16:08
What exactly happens when your program dies after 25 min? Does it freeze? Raise an exception? I don't suppose you could run out of memory, so it should be possible to identify the precise point where something goes wrong. — max
– max, Commented Apr 11, 2012 at 16:26
Your requirements are quite unclear. Do you want a full explanation of the multiprocessing lib? try reading docs.python.org/library/multiprocessing.html. It is well written and should contain everything needed. Off topic: you have things like noluck = "false" in your code. Do you know that boolean types True and False are available in python? — Simon Bergot
– Simon Bergot, Commented Apr 11, 2012 at 16:35
@Simon - my first jab at Python was last week, thanks for the heads up!<br/> — Danielle Neri
– Danielle Neri, Commented Apr 11, 2012 at 16:37

tom10 · Accepted Answer · 2012-04-11 17:22:16Z

5

What to do depends on what's slowing down the process.

If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.

If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).

If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.

If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.

edited Apr 11, 2012 at 17:22

answered Apr 11, 2012 at 16:51

tom10

69.5k11 gold badges133 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Danielle Neri Over a year ago

All files are on a single local disk. My goal is to condense everything into one file per directory in order to limit I/O bottlenecks. Most of the delay in in CPU time spent parsing the files.

tom10 Over a year ago

OK. I edited my answer, so I think my second entry applies to your situation.

Spencer Rathbun · Accepted Answer · 2012-04-11 17:49:19Z

1

Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.

I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.

This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.

Also on the same site, check out the coroutines section.

edited Apr 11, 2012 at 17:49

answered Apr 11, 2012 at 17:42

Spencer Rathbun

15k6 gold badges56 silver badges74 bronze badges

Collectives™ on Stack Overflow

Multithreading / Multiprocessing in Python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related