I would like to understand if there is any way to use the multithreading in a for loop, I have a big txt file (35GB), the script needs to split and strip each line and print the result in an another txt file, the problem is it's pretty slow and I would like to make it faster. I thought about using a lock but I'm still not sure if it could work, anyone have any ideas? Thanks :D
2 Answers
TL;DR the comments:
you are almost guaranteed to be limited by the read speed of your hard drive if the computation you are doing on each line is relatively limited. do some real profiling of your code to find where the slowdown actually is. If the data you are writing to file is much smaller than your 35G file (it would all fit in ram), you might just find a speedup by writing it after your read is complete to allow the drive to work entirely sequentially (also maybe not).
example of profiling converting text file to csv:
from cProfile import Profile
def main(debug=False):
maxdata = 1000000 #read at most (roughly)`maxdata` bytes from file if debug == True
with open('bigfile.txt', 'r') as fin:
with open('outfile.csv', 'w') as fout:
for line in fin:
fout.write(','.join(line.split())) #split on spaces to convert to csv
if debug and fin.tell() >= maxdata: #if debug
break
Profile.enable()
main(debug=True)
Profile.disable()
Profile.print_stats()
3 Comments
On SSDs and HDDs:
As others have pointed out, your main constraint here is going to be your hard drive. If you're using an HDD and not an SSD, you're actually going to see a decrease in performance by attempting to have multiple threads read from the disk at the same time (assuming they're trying to read randomly distributed blocks of data from the disk and are reading sequentially).
If you look at how a hard drive works, it has a head must seek (scan) to find the location of the data you're attempting to read. If you have multiple threads, they will still be limited by the fact that the hard drive can only read one block at a time. Hard drives perform well when reading/writing sequentially but do not perform well when reading/writing from random locations on the disk.
On the other hand if you look at how a solid state drive works, it is the opposite. The solid state drive does better at reading from random places in storage. SSDs do not have seek latency which makes them great at reading from multiple places on disk.
The optimal structure of your program will look different depending on whether or not you're using an HDD or an SSD.
Optimal HDD Solution:
Assuming you're using an HDD for storage, your optimal solution looks something like this:
Read a large chunk of data into memory from the main thread. Be sure you read in increments of your block size, which will increase performance.
- If your HDD stores data in blocks of 4kB (or 4096 bytes), you should read in multiples of 4096. Most modern disk sectors (another term for blocks) are 4kB. Older legacy disks will have sectors of 512 bytes. You can find out how big your blocks/sectors are by using
lsblkorfdiskon linux. - You will need to play around with different multiples of your block size, steadily increasing the amount of data you're reading, to see what size gives the best performance. If you read too much data in at once your program will be inefficient (because of read speeds). If you don't read enough data in at once, your program will also be inefficient (because of too many reads).
- I'd start with 10 times your block size, then 20 times your block size, then 30 times your block size, until you find the optimal size of data to read in at once.
- If your HDD stores data in blocks of 4kB (or 4096 bytes), you should read in multiples of 4096. Most modern disk sectors (another term for blocks) are 4kB. Older legacy disks will have sectors of 512 bytes. You can find out how big your blocks/sectors are by using
Once your main thread has read from disk, you can spawn multiple threads to process the data.
- Since python has a GIL (global interpreter lock) for thread safety, you may want to use multiprocessing instead. The
multiprocessinglibrary is very similar to thethreadinglibrary.
- Since python has a GIL (global interpreter lock) for thread safety, you may want to use multiprocessing instead. The
While the child threads/processes are processing the data, have the main thread read in another chunk of data from the disk. Wait until the children have finished to spawn more for processing, and keep repeating this process.
sed)...