0

I would like to understand if there is any way to use the multithreading in a for loop, I have a big txt file (35GB), the script needs to split and strip each line and print the result in an another txt file, the problem is it's pretty slow and I would like to make it faster. I thought about using a lock but I'm still not sure if it could work, anyone have any ideas? Thanks :D

13
  • 3
    Your throughput is going to be limited by disk-speed, so multi-threading will likely slow you down (probably by 15-20% or so -- I'm guessing..) Commented Jan 20, 2017 at 20:18
  • 1
    There are faster tools for this type of work though (eg. sed)... Commented Jan 20, 2017 at 20:19
  • 2
    @thebjorn yes, unless you're performing heavy CPU processing in the middle, read/write as multithreaded is the best way to knit a nice sweater out of the HDD arms :) Commented Jan 20, 2017 at 20:20
  • 2
    multiprocessing is a very effective way to leverage more computing resources, but I'd bet this is still hdd limited Commented Jan 20, 2017 at 20:23
  • 1
    @Aaron indeed. You can't finish faster than the disk can write, and a single CPU can provide data much (much! - seriously MUCH!) faster than a disk can write it ;-) Commented Jan 20, 2017 at 20:25

2 Answers 2

1

TL;DR the comments:

you are almost guaranteed to be limited by the read speed of your hard drive if the computation you are doing on each line is relatively limited. do some real profiling of your code to find where the slowdown actually is. If the data you are writing to file is much smaller than your 35G file (it would all fit in ram), you might just find a speedup by writing it after your read is complete to allow the drive to work entirely sequentially (also maybe not).

example of profiling converting text file to csv:

from cProfile import Profile

def main(debug=False):
    maxdata = 1000000 #read at most (roughly)`maxdata` bytes from file if debug == True
    with open('bigfile.txt', 'r') as fin:
        with open('outfile.csv', 'w') as fout:
            for line in fin:
                fout.write(','.join(line.split())) #split on spaces to convert to csv
                if debug and fin.tell() >= maxdata: #if debug
                    break

Profile.enable()
main(debug=True)
Profile.disable()
Profile.print_stats()
Sign up to request clarification or add additional context in comments.

3 Comments

Yes but if I wait for the read to be completed, won't it fill up my ram? The for line is the less RAM consuming way I think, anyway I thought that multiprocessing could have increased the speed, it seems like it won't as the comments says
if you can't wait for the read to complete that's fine, it was just a suggestion.. in reality buffering will make the alternating reads and writes not so bad. I've written up a short example of how to profile a simple file read - write schema.
thanks, I appreciate it, I would like to accept both answers because you were both really kind explaining me how to do it, thanks :D
0

On SSDs and HDDs:

As others have pointed out, your main constraint here is going to be your hard drive. If you're using an HDD and not an SSD, you're actually going to see a decrease in performance by attempting to have multiple threads read from the disk at the same time (assuming they're trying to read randomly distributed blocks of data from the disk and are reading sequentially).

If you look at how a hard drive works, it has a head must seek (scan) to find the location of the data you're attempting to read. If you have multiple threads, they will still be limited by the fact that the hard drive can only read one block at a time. Hard drives perform well when reading/writing sequentially but do not perform well when reading/writing from random locations on the disk.

On the other hand if you look at how a solid state drive works, it is the opposite. The solid state drive does better at reading from random places in storage. SSDs do not have seek latency which makes them great at reading from multiple places on disk.

The optimal structure of your program will look different depending on whether or not you're using an HDD or an SSD.


Optimal HDD Solution:

Assuming you're using an HDD for storage, your optimal solution looks something like this:

  1. Read a large chunk of data into memory from the main thread. Be sure you read in increments of your block size, which will increase performance.

    • If your HDD stores data in blocks of 4kB (or 4096 bytes), you should read in multiples of 4096. Most modern disk sectors (another term for blocks) are 4kB. Older legacy disks will have sectors of 512 bytes. You can find out how big your blocks/sectors are by using lsblk or fdisk on linux.
    • You will need to play around with different multiples of your block size, steadily increasing the amount of data you're reading, to see what size gives the best performance. If you read too much data in at once your program will be inefficient (because of read speeds). If you don't read enough data in at once, your program will also be inefficient (because of too many reads).
    • I'd start with 10 times your block size, then 20 times your block size, then 30 times your block size, until you find the optimal size of data to read in at once.
  2. Once your main thread has read from disk, you can spawn multiple threads to process the data.

    • Since python has a GIL (global interpreter lock) for thread safety, you may want to use multiprocessing instead. The multiprocessing library is very similar to the threading library.
  3. While the child threads/processes are processing the data, have the main thread read in another chunk of data from the disk. Wait until the children have finished to spawn more for processing, and keep repeating this process.

1 Comment

Thanks for the detailed answer :D

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.