Parallel text processing in julia

Question

I'm trying to write a simple function that reads a series of files and performs some regex search (or just a word count) on them and then return the number of matches, and I'm trying to make this run in parallel to speed it up, but so far I have been unable to achieve this.

If I do a simple loop with a math operation I do get significant performance increases. However, a similar idea for the grep function doesn't provide speed increases:

function open_count(file)
    fh = open(file)
    text = readall(fh)
    length(split(text))
end



tic()
total = 0
for name in files
    total += open_count(string(dir,"/",name))
    total
end
toc()
elapsed time: 29.474181026 seconds


tic()
total = 0
total = @parallel (+) for name in files
    open_count(string(dir,"/",name))
end
toc()

elapsed time: 29.086511895 seconds

I tried different versions but also got no significant speed increases. Am I doing something wrong?

27 seconds to process a file? I'd guess these are fairly big disk files, they won't fit in your processor'd disk cache, and have to be read from the disk each time. Then the best you can hope for is time equal to the time to read both files from the disk. Typically the disk can only read one place at a time --> disk reads are sequential and thus no speedup. — Ira Baxter
– Ira Baxter, Commented Jan 23, 2014 at 8:01
It's not one single file, it's a list of files (almost a GB in total I think). I should have said that. But thanks for that explanation. — MGN
– MGN, Commented Jan 23, 2014 at 8:12
I can not test this, because I do not have files of this size to test on. Could you publish a script that generates something with the same structure and size? Your OS is probably taking up most of the time here. Have you considered closing the files in open_count()? — ivarne
– ivarne, Commented Jan 23, 2014 at 9:25
Have you profiled? Doing so will tell you whether the bottleneck is in the I/O or the regex. If it's the former, consider spreading your files across multiple drives. — tholy
– tholy, Commented Jan 23, 2014 at 11:25
@ivarne closing the files did slightly improve performance, that will be helpful. With this script you can get a similar looking corpus. — MGN
– MGN, Commented Jan 23, 2014 at 16:41

niczky12 · Accepted Answer · 2016-01-22 22:39:16Z

I've had similar problems with R and Python. As others pointed out in the comment, you should start with the profiler.

If the read is taking up the majority of time then there's not much you can do. You can try moving the files to different hard drives and read them in from there. You can also try a RAMDisk kind of solution, which basically makes your RAM look like permanent storage (reducing available ram) but then you can get very fast read and writes.

However, if the time is used to do the regex, than consider the following: Create a function that reads in one file as whole and splits out separate lines. That should be a continuous read hence as fast as possible. Then create a parallel version of your regex which processes each line in parallel. This way the whole file is in memory and your computing cores can munge the data a faster rate. That way you might see some increase in performance.

This is a technique I used when trying to process large text files.

Collectives™ on Stack Overflow

Parallel text processing in julia

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related