Java Parser with multithreading

Question

Currently i have a parser setup that will parse through csv files of ~2 million records. Then I apply some filtering algorithms to weed out the records I want to include/exclude. Finally writing everything back to a new csv file.

I have done some benchmarking and it turns out that writing data to the csv is very expensive and causes massive slowdowns when filtering and appending to a file at the same time. I was wondering if i could perform all my filtering, placing the lines to be written in a queue then have a second process perform all the writing on its own when that queue is full or all filtering is complete.

So basically to summarize:

Read line 
Decide whether to discard or keep
if I'm keeping the file, add it to the "Write Queue"
Check if the write queue is full, if so, start the new process that will begin writing
Continue filtering until completed

Thanks for all your help!

EDIT: The way im writing is the following:

FileWriter fw = new FileWriter("myFile.csv");
BufferedWriter bw = new BufferedWriter(fw);
while(read file...) {
   //perform filters etc...
    try {
        bw.write(data.trim());
        bw.newLine();

    }catch(IOException e) {
        System.out.println(e.getMessage());
    }

do you mind posting some code on how you're writing out to the csv file? — Shawn
– Shawn, Commented Jul 17, 2012 at 3:28
I do this exact approach with Python and read, write and process in completely different threads. It's possible. — Blender
– Blender, Commented Jul 17, 2012 at 3:30
@Blender how much more efficient did your python parser perform? — 1337holiday
– 1337holiday, Commented Jul 17, 2012 at 3:47
You can spawn a new Thread and let that thread does all the writing work. — nhahtdh
– nhahtdh, Commented Jul 17, 2012 at 3:48

David Harkness · Accepted Answer · 2012-07-17 03:51:15Z

3

The read and write processes are both I/O bound (seeking to sectors on disk and performing disk I/O to/from memory) while the filtering process is entirely CPU bound. This is a good candidate for multithreading.

I would use three threads: reading, filtering, and writing. This calls for two queues, but there's no reason to wait for the queues to become full before processing.

The reader thread reads from the file and appends rows to the incoming queue.
The filter thread takes rows from the incoming queue and writes those that pass the filter to the outgoing queue.
The writer thread takes rows from the outgoing queue and writes them to the new file.

Make sure to use buffered readers and writers to minimize contention between the reader and writer threads. You want to minimize disk seeking since that will be the bottleneck, assuming the filtering process is fairly simple.

answered Jul 17, 2012 at 3:51

David Harkness

36.6k11 gold badges119 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

1337holiday Over a year ago

Perfect! I will most likely be implementing this.

yshavit Over a year ago

+1, but it may make sense to have a mechanism by which you either read or write, but not at the same time. That way, you won't have two threads contending for the same disk head. A semaphore would be a good option here: the reader thread reads a bunch, then releases its hold so that the writer can take it, and it then releases it back to the reader. You'll have to have some heuristics so that a thread doesn't release the semaphore only to immediately re-aquire it.

David Harkness Over a year ago

@yshavit - Why not use the disk as the semaphore? It has low overhead and works just as well. Sure, you may be able to eek out slightly better performance if you read multiple blocks at a time, but only if the file is guaranteed to be stored in contiguous blocks. Also, if you are writing to a different disk from the disk you're reading, this would serialize processes that could run concurrently. It's something to consider for sure though.

Martin James Over a year ago

@DavidHarkness - if the reading/writing is not done in reasonably large blocks, the inter-thread comms time will be higher than the parsing time:( A 64k block maybe. There needs to be a little code to handle the one record that usually spans the end of the blocks - parse back from the end of buffer to find start of last record and append to start of next buffer before reading next block from disk.

David Harkness Over a year ago

@MartinJames - Thus the desire for buffered I/O. Without a detailed description of the filtering process, we can only postulate about the various optimizations for the read/write threads.

|

Srihari Gouru · Accepted Answer · 2012-07-17 04:23:17Z

0

You may want to consider using Spring Batch unless you have any constraints on using Spring.

answered Jul 17, 2012 at 4:23

Srihari Gouru

11 bronze badge

Collectives™ on Stack Overflow

Java Parser with multithreading

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related