Java Parallel File Processing

Question

I have following code:

import java.io.*;
import java.util.concurrent.* ;
public class Example{
public static void main(String args[]) {
    try {
        FileOutputStream fos = new FileOutputStream("1.dat");
        DataOutputStream dos = new DataOutputStream(fos);

        for (int i = 0; i < 200000; i++) {
            dos.writeInt(i);
        }
        dos.close();                                                         // Two sample files created

        FileOutputStream fos1 = new FileOutputStream("2.dat");
        DataOutputStream dos1 = new DataOutputStream(fos1);

        for (int i = 200000; i < 400000; i++) {
            dos1.writeInt(i);
        }
        dos1.close();

        Exampless.createArray(200000); //Create a shared array
        Exampless ex1 = new Exampless("1.dat");
        Exampless ex2 = new Exampless("2.dat");
        ExecutorService executor = Executors.newFixedThreadPool(2); //Exexuted parallaly to cont number of matches in two file
        long startTime = System.nanoTime();
        long endTime;
        Future<Integer> future1 = executor.submit(ex1);
        Future<Integer> future2 = executor.submit(ex2);
        int count1 = future1.get();
        int count2 = future2.get();
        endTime = System.nanoTime();
        long duration = endTime - startTime;
        System.out.println("duration with threads:"+duration);
        executor.shutdown();
        System.out.println("Matches: " + (count1 + count2));

        startTime = System.nanoTime();
        ex1.call();
        ex2.call();
        endTime = System.nanoTime();
        duration = endTime - startTime;
        System.out.println("duration without threads:"+duration);

    } catch (Exception e) {
        System.err.println("Error: " + e.getMessage());
    }
}
}

class Exampless implements Callable {

public static int[] arr = new int[20000];
public String _name;

public Exampless(String name) {
    this._name = name;
}

static void createArray(int z) {
    for (int i = z; i < z + 20000; i++) { //shared array
        arr[i - z] = i;
    }
}

public Object call() {
    try {
        int cnt = 0;
        FileInputStream fin = new FileInputStream(_name);
        DataInputStream din = new DataInputStream(fin);      // read file and calculate number of matches
        for (int i = 0; i < 20000; i++) {
            int c = din.readInt();
            if (c == arr[i]) {
                cnt++;
            }
        }
        return cnt ;
    } catch (Exception e) {
        System.err.println("Error: " + e.getMessage());
    }
    return -1 ;
}

}

Where I am trying to count number of matches in an array with two files. Now, though I am running it on two threads, code is not doing well because:

(running it on single thread, file 1 + file 2 reading time) < (file 1 || file 2 reading time in multiple thread).

Can anyone help me how to solve this (I have 2 core CPU and file size is approx. 1.5 GB).

@SurajChandran, most of the times. And truly no effect.:) Just run a test. — Arpssss
– Arpssss, Commented Jul 31, 2012 at 16:33
@Arpssss, I have added timing to the code listing, hope you don't mind that. On my machine, the threaded version always runs faster then the sequential, although the difference is not much. Like 48297747 nanoseconds for the threaded and 78930159 nanoseconds without threads. — bpgergo
– bpgergo, Commented Jul 31, 2012 at 16:59
@bpgergo, Please increase the file size by increasing the for loop arguments. I just give an example. — Arpssss
– Arpssss, Commented Jul 31, 2012 at 17:10

Tomasz Nurkiewicz · Accepted Answer · 2012-07-31 16:41:58Z

7

In the first case you are reading sequentially one file, byte-by-byte, block-by-block. This is as fast as disk I/O can be, providing the file is not very fragmented. When you are done with the first file, disk/OS finds the beginning of the second file and continues very efficient, linear reading of disk.

In the second case you are constantly switching between the first and the second file, forcing the disk to seek from one place to another. This extra seeking time (approximately 10 ms) is the root of your confusion.

Oh, and you know that disk access is single-threaded and your task is I/O bound so there is no way splitting this task to multiple threads could help, as long as your reading from the same physical disk? Your approach could only be justified if:

each thread, except reading from a file, was also performing some CPU intensive or blocking operations, slower by an order of magnitude compared to I/O.
files are on different physical drives (different partition is not enough) or on some RAID configurations
you are using SSD drive

edited Jul 31, 2012 at 16:41

answered Jul 31, 2012 at 16:32

Tomasz Nurkiewicz

342k72 gold badges713 silver badges680 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

RedGreasel Over a year ago

+1. This is a fundamental problem that many people don't understand: only increasing the limiting reagent is going to increase performance.

onit · Accepted Answer · 2012-07-31 16:54:20Z

You will not get any benefit from multithreading as Tomasz pointed out from reading the data from disk. You may get some improvement in speed if you multithread the checks, i.e. you load the data from the files into arrays sequentially and then the threads execute the checking in parallel. But considering the small size of your files (~80kb) and the fact that you are just comparing ints I doubt the performance improvement will be worth the effort.

Something that will definitely improve your execution speed is if you do not use readInt(). Since you know you are comparing 20000 ints, you should read all 20000 ints into an array at once for each file (or at least in blocks), rather than calling the readInt() function 20000 times.

Collectives™ on Stack Overflow

Java Parallel File Processing

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related