Python, use multithreading in a for loop

Question

I would like to understand if there is any way to use the multithreading in a for loop, I have a big txt file (35GB), the script needs to split and strip each line and print the result in an another txt file, the problem is it's pretty slow and I would like to make it faster. I thought about using a lock but I'm still not sure if it could work, anyone have any ideas? Thanks :D

Your throughput is going to be limited by disk-speed, so multi-threading will likely slow you down (probably by 15-20% or so -- I'm guessing..) — thebjorn
– thebjorn, Commented Jan 20, 2017 at 20:18
There are faster tools for this type of work though (eg. sed)... — thebjorn
– thebjorn, Commented Jan 20, 2017 at 20:19
@thebjorn yes, unless you're performing heavy CPU processing in the middle, read/write as multithreaded is the best way to knit a nice sweater out of the HDD arms :) — Jean-François Fabre
– Jean-François Fabre ♦, Commented Jan 20, 2017 at 20:20
multiprocessing is a very effective way to leverage more computing resources, but I'd bet this is still hdd limited — Aaron
– Aaron, Commented Jan 20, 2017 at 20:23
@Aaron indeed. You can't finish faster than the disk can write, and a single CPU can provide data much (much! - seriously MUCH!) faster than a disk can write it ;-) — thebjorn
– thebjorn, Commented Jan 20, 2017 at 20:25

Aaron · Accepted Answer · 2017-01-20 20:44:05Z

1

TL;DR the comments:

you are almost guaranteed to be limited by the read speed of your hard drive if the computation you are doing on each line is relatively limited. do some real profiling of your code to find where the slowdown actually is. If the data you are writing to file is much smaller than your 35G file (it would all fit in ram), you might just find a speedup by writing it after your read is complete to allow the drive to work entirely sequentially (also maybe not).

example of profiling converting text file to csv:

from cProfile import Profile

def main(debug=False):
    maxdata = 1000000 #read at most (roughly)`maxdata` bytes from file if debug == True
    with open('bigfile.txt', 'r') as fin:
        with open('outfile.csv', 'w') as fout:
            for line in fin:
                fout.write(','.join(line.split())) #split on spaces to convert to csv
                if debug and fin.tell() >= maxdata: #if debug
                    break

Profile.enable()
main(debug=True)
Profile.disable()
Profile.print_stats()

edited Jan 20, 2017 at 20:44

answered Jan 20, 2017 at 20:27

Aaron

11.2k1 gold badge27 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Maxpnl Over a year ago

Yes but if I wait for the read to be completed, won't it fill up my ram? The for line is the less RAM consuming way I think, anyway I thought that multiprocessing could have increased the speed, it seems like it won't as the comments says

Aaron Over a year ago

if you can't wait for the read to complete that's fine, it was just a suggestion.. in reality buffering will make the alternating reads and writes not so bad. I've written up a short example of how to profile a simple file read - write schema.

Maxpnl Over a year ago

thanks, I appreciate it, I would like to accept both answers because you were both really kind explaining me how to do it, thanks :D

Community · Accepted Answer · 2020-06-20 09:12:55Z

On SSDs and HDDs:

As others have pointed out, your main constraint here is going to be your hard drive. If you're using an HDD and not an SSD, you're actually going to see a decrease in performance by attempting to have multiple threads read from the disk at the same time (assuming they're trying to read randomly distributed blocks of data from the disk and are reading sequentially).

If you look at how a hard drive works, it has a head must seek (scan) to find the location of the data you're attempting to read. If you have multiple threads, they will still be limited by the fact that the hard drive can only read one block at a time. Hard drives perform well when reading/writing sequentially but do not perform well when reading/writing from random locations on the disk.

On the other hand if you look at how a solid state drive works, it is the opposite. The solid state drive does better at reading from random places in storage. SSDs do not have seek latency which makes them great at reading from multiple places on disk.

The optimal structure of your program will look different depending on whether or not you're using an HDD or an SSD.

Optimal HDD Solution:

Assuming you're using an HDD for storage, your optimal solution looks something like this:

Read a large chunk of data into memory from the main thread. Be sure you read in increments of your block size, which will increase performance.
- If your HDD stores data in blocks of 4kB (or 4096 bytes), you should read in multiples of 4096. Most modern disk sectors (another term for blocks) are 4kB. Older legacy disks will have sectors of 512 bytes. You can find out how big your blocks/sectors are by using lsblk or fdisk on linux.
- You will need to play around with different multiples of your block size, steadily increasing the amount of data you're reading, to see what size gives the best performance. If you read too much data in at once your program will be inefficient (because of read speeds). If you don't read enough data in at once, your program will also be inefficient (because of too many reads).
- I'd start with 10 times your block size, then 20 times your block size, then 30 times your block size, until you find the optimal size of data to read in at once.
Once your main thread has read from disk, you can spawn multiple threads to process the data.
- Since python has a GIL (global interpreter lock) for thread safety, you may want to use multiprocessing instead. The multiprocessing library is very similar to the threading library.
While the child threads/processes are processing the data, have the main thread read in another chunk of data from the disk. Wait until the children have finished to spawn more for processing, and keep repeating this process.

Collectives™ on Stack Overflow

Python, use multithreading in a for loop

2 Answers 2

3 Comments

On SSDs and HDDs:

Optimal HDD Solution:

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

On SSDs and HDDs:

Optimal HDD Solution:

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related