1

We process huge files (sometimes 50 GB each file). The application reads this one file and based on the business logic, it will write multiple output files (4-6).

The records in the file are of variable length and each field in a record is a delimiter separated.

Going by the understanding that reading a file using FileChannel with a ByteBuffer was always better than using a BufferedReader.readLine and then using a split by the delimiter.

  • BufferSizes tried 10240(10KB) and even more
  • Commit interval - 5000, 10000 etc

Below is how we used file channel to read:

  • Read byte by byte. Check if the read byte is a new line char(10) - which means end of line.
  • check for delimiter bytes. capture the bytes read in a byte array(we initialized this byte array with a maximum field size of 350 bytes) until delimiter bytes are encountered.
  • convert these bytes read until this time, to String using UTF-8 encoding - new String(byteArr, 0, index,"UTF-8") to be specific - index is the number of bytes read until delimiter.

Using this method of reading the file using FileChannel took 57 minutes to process the file.

We want to decrease this time and tried using BufferredReader.readLine() and then use a split by delimiter, to see how it fares.

And shockingly the same file completed processing only in 7 minutes.

What's the catch here? Why FileChannel is taking more time than a buffered reader and then using a string split.

I was always under the assumption that ReadLine and Split combination will have a big performance impact?

Can any one throw light on if I was using FileChannel in a wrong way? One

Thanks in advance. Hope I have summarized the issue properly.

The below is sample code :

while (inputByteBuffer.hasRemaining() && (b = inputByteBuffer.get()) != 0){
        boolean endOfField = false;
        if (b == 10){
            break;
        }
        else{
            if (b == 94){//^
                if (!inputByteBuffer.hasRemaining()){
                    inputByteBuffer.clear();
                    noOfBytes = inputFileChannel.read(inputByteBuffer);
                    inputByteBuffer.flip();
                }
                if (inputByteBuffer.hasRemaining()){
                    byte b2 = inputByteBuffer.get();
                    if (b2 == 124){//|
                        if (!inputByteBuffer.hasRemaining()){
                            inputByteBuffer.clear();
                            noOfBytes = inputFileChannel.read(inputByteBuffer);
                            inputByteBuffer.flip();
                        }

                        if (inputByteBuffer.hasRemaining()){
                            byte b3 = inputByteBuffer.get();
                            if (b3 == 94){//^
                                String field = new String(fieldBytes, 0, index, encoding);
                                if(fieldIndex == -1){
                                    fields = new String[sizeFromAConfiguration];
                                }else{
                                    fields[fieldIndex] = field;
                                }

                                fieldBytes = new byte[maxFieldSize];
                                endOfField = true;
                                fieldIndex++;
                            }
                            else{
                                fieldBytes = addFieldBytes(fieldBytes, b, index);
                                index++;
                                fieldBytes = addFieldBytes(fieldBytes, b2, index);
                                index++;
                                fieldBytes = addFieldBytes(fieldBytes, b3, index);
                            }
                        }
                        else{
                            endOfFile = true;
                            //fields.add(new String(fieldBytes, 0, index, encoding));
                            fields[fieldIndex] = new String(fieldBytes, 0, index, encoding);
                            fieldBytes = new byte[maxFieldSize];
                            endOfField = true;
                        }
                    }else{
                        fieldBytes = addFieldBytes(fieldBytes, b, index);
                        index++;
                        fieldBytes = addFieldBytes(fieldBytes, b2, index);

                    }
                }else{
                    endOfFile = true;
                    fieldBytes = addFieldBytes(fieldBytes, b, index);
                }
            }
            else{
                fieldBytes = addFieldBytes(fieldBytes, b, index);
            }
        }

        if (!inputByteBuffer.hasRemaining()){
            inputByteBuffer.clear();
            noOfBytes = inputFileChannel.read(inputByteBuffer);
            inputByteBuffer.flip();
        }

        if (endOfField){
            index = 0;
        }
        else{
            index++;
        }

    }
10
  • 3
    BufferedReader doesn't read byte-by-byte and neither should you. You should pick a reasonably sized buffer (BufferedReader has a 8192 byte buffer). Yes it will be more difficult to implement, but you won't be wasting CPU cycles reading a single byte at a time. Commented Mar 2, 2018 at 1:24
  • Reading any file byte by byte is the worst case. Anything will be an improvement over that. readLine() is approximately the best case. Splitting, or rather creating, strings should be avoided. Commented Mar 2, 2018 at 1:26
  • May be I have stated it incorrectly, when I said read byte by byte, we first read 10240 bytes into byte buffer, and then ByteBuffer.get() method to check which byte it is. Commented Mar 2, 2018 at 1:27
  • 1
    As you clearly can't describe it accurately you should certainly post some code. Commented Mar 2, 2018 at 1:28
  • if (!inputByteBuffer.hasRemaining()){ inputByteBuffer.clear(); noOfBytes = inputFileChannel.read(inputByteBuffer); inputByteBuffer.flip(); } Commented Mar 2, 2018 at 1:28

4 Answers 4

2

You're causing a lot of overhead with the constant hasRemaining()/read() checks as well as the constant get() calls. It would probably be better to get() the entire buffer into an array and process that directly, only calling read() when you get to the end.

And to answer a question in comments, you should not allocate a new ByteBuffer per read. This is expensive. Keep using the same one. And NB do not use a DirectByteBuffer for this application. It is not appropriate: it's only appropriate when you want the data to stay south of the JVM/JNI boundary, e.g. when merely copying between channels.

But I think I would throw this away, or rather rewrite it, using BufferedReader.read(), rather than readLine() followed by string splits, and using much the same logic as you have here, except of course that you don't need to keep calling hasRemaining() and filling the buffer, which BufferedReader will do automatically for you.

You have to take care to store the result of read() into an int, and to check it for -1 after every read().

It isn't clear to me that you should be using a Reader at all actually, unless you know you have multibyte text. Possibly a simple BufferedInputStream would be more appropriate.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your detailed explanation.. our file is UTF-8 encoded and there are special chars which a character can constitute to two bytes.. and we have to convert to string to do some business logic.
I would try the above suggestions and post here of any improvements/findings
I tried copying the bytebuffer contents into byte[] and then iterating through byte[], but it looks like its taking even more time than before : if (byteArrayIndex == 0 || byteArrayIndex == byteArray.length){ inputByteBuffer.clear(); byteArrayIndex = 0; int noOfBytes = inputFileChannel.read(inputByteBuffer); inputByteBuffer.flip(); byteArray = new byte[noOfBytes]; inputByteBuffer.get(byteArray,0,noOfBytes); }
One reason could be the original author of the code did not profile the code. :)
2

While one cannot tell with certainty how a particular code will behave I would imagine the best way is to profile it just like you did.The FileChannel while percieved to be faster is actually not helping in your case.But this may not be because of reading from the file but actual processing that you do with the content you read. One article I would like to point out while dealing with files is https://www.redgreencode.com/why-is-java-io-slow/

Also the corresponding Github codebase Java IO benchmark

I would like to point out this code to use a combination of both worlds fos = new FileOutputStream(outputFile); outFileChannel = fos.getChannel(); bufferedWriter = new BufferedWriter(Channels.newWriter(outFileChannel, "UTF-8"));

Since it is read in your case I will consider

File inputFile = new File("C:\\input.txt");
FileInputStream fis = new FileInputStream(inputFile);
FileChannel inputChannel = fis.getChannel();
BufferedReader bufferedReader = new BufferedReader(Channels.newReader(inputChannel,"UTF-8"));

Also I will tweak the chunksize and with Spring batch it is always trial and error to find sweet spot.

On a completely unrelated note the reason for your problem of not able to use BufferedReader is because of doubling of charecters and I am assuming this happens more commonly with ebcdic charecters.I will simply run a loop like this to identfy the troublemakers and eliminate at the source.

import java.io.UnsupportedEncodingException;

public class EbcdicConvertor {

    public static void main(String[] args) throws UnsupportedEncodingException {
        int index = 0;
        for (int i = -127; i < 128; i++) {
            byte[] b = new byte[1];
            b[0] = (byte) i;
            String cp037 = new String(b, "CP037");
            if (cp037.getBytes().length == 2) {
                index++;
                System.out.println(i + "::" + cp037);
            }
        }
        System.out.println(index);
    }
}

The above answer is without testing my actual hypothesis.Here is an actual program to measure time.The results speak for themselves on a 200 MB file

import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;

public class ReadComplexDelimitedFile {
    private static long total = 0;
    private static final Pattern DELIMITER_PATTERN = Pattern.compile("\\^\\|\\^");

    private void readFileUsingScanner() {

        String s;
        try (Scanner stdin = new Scanner(new File(this.getClass().getResource("input.txt").getPath()))) {
            while (stdin.hasNextLine()) {
                s = stdin.nextLine();
                String[] fields = DELIMITER_PATTERN.split(s, 0);
                total = total + fields.length;
            }
        } catch (Exception e) {
            System.err.println("Error");
        }

    }

    private void readFileUsingCustomBufferedReader() {

        try (BufferedReader stdin = new BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
            String s;
            while ((s = stdin.readLine()) != null) {
                String[] fields = DELIMITER_PATTERN.split(s, 0);
                total += fields.length;
            }
        } catch (Exception e) {
            System.err.println("Error");
        }

    }

    private void readFileUsingBufferedReader() {

        try (java.io.BufferedReader stdin = new java.io.BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
            String s;
            while ((s = stdin.readLine()) != null) {
                String[] fields = DELIMITER_PATTERN.split(s, 0);
                total += fields.length;
            }
        } catch (Exception e) {
            System.err.println("Error");
        }

    }


    private void readFileUsingBufferedReaderFileChannel() {
        try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
            try (FileChannel inputChannel = fis.getChannel()) {
                try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
                    String s;
                    while ((s = stdin.readLine()) != null) {
                        String[] fields = DELIMITER_PATTERN.split(s, 0);
                        total = total + fields.length;
                    }
                }
            } catch (Exception e) {
                System.err.println("Error");
            }
        } catch (Exception e) {
            System.err.println("Error");
        }

    }

    private void readFileUsingBufferedReaderByteFileChannel() {
        try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
            try (FileChannel inputChannel = fis.getChannel()) {
                try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
                    int b;
                    StringBuilder sb = new StringBuilder();
                    while ((b = stdin.read()) != -1) {
                        if (b == 10) {

                            total = total + DELIMITER_PATTERN.split(sb, 0).length;
                            sb = new StringBuilder();
                        } else {
                            sb.append((char) b);
                        }
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        } catch (Exception e) {
            System.err.println("Error");
        }

    }

    private void readFileUsingFileChannelStream() {

        try (RandomAccessFile fis = new RandomAccessFile(new File(this.getClass().getResource("input.txt").getPath()), "r")) {
            try (FileChannel inputChannel = fis.getChannel()) {
                ByteBuffer byteBuffer = ByteBuffer.allocate(8192);
                ByteBuffer recordBuffer = ByteBuffer.allocate(250);
                int recordLength = 0;
                while ((inputChannel.read(byteBuffer)) != -1) {
                    byte b;
                    byteBuffer.flip();
                    while (byteBuffer.hasRemaining() && (b = byteBuffer.get()) != -1) {
                        if (b == 10) {
                            recordBuffer.flip();
                            total = total + splitIntoFields(recordBuffer, recordLength);
                            recordBuffer.clear();
                            recordLength = 0;
                        } else {
                            ++recordLength;
                            recordBuffer.put(b);
                        }
                    }
                    byteBuffer.clear();
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    private int splitIntoFields(ByteBuffer recordBuffer, int recordLength) {
        byte b;
        String[] fields = new String[17];
        int fieldCount = -1;
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < recordLength - 1; i++) {
            b = recordBuffer.get(i);
            if (b == 94 && recordBuffer.get(++i) == 124 && recordBuffer.get(++i) == 94) {
                fields[++fieldCount] = sb.toString();
                sb = new StringBuilder();
            } else {
                sb.append((char) b);
            }
        }
        fields[++fieldCount] = sb.toString();
        return fields.length;

    }


    public static void main(String args[]) {
        //JVM wamrup
        for (int i = 0; i < 100000; i++) {
            total += i;
        }
        // We know scanner is slow-Still warming up
        ReadComplexDelimitedFile readComplexDelimitedFile = new ReadComplexDelimitedFile();
        List<Long> longList = new ArrayList<>(50);
        for (int i = 0; i < 50; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingScanner();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingScanner");
        longList.forEach(System.out::println);
        // Actual performance test starts here

        longList = new ArrayList<>(10);
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingBufferedReaderFileChannel();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingBufferedReaderFileChannel");
        longList.forEach(System.out::println);
        longList.clear();
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingBufferedReader();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingBufferedReader");
        longList.forEach(System.out::println);
        longList.clear();
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingCustomBufferedReader();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingCustomBufferedReader");
        longList.forEach(System.out::println);
        longList.clear();
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingBufferedReaderByteFileChannel();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingBufferedReaderByteFileChannel");
        longList.forEach(System.out::println);
        longList.clear();
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingFileChannelStream();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingFileChannelStream");
        longList.forEach(System.out::println);

    }
}

BufferedReader was written very long back and hence we can rewrite some parts relevant to this example.For instance we don't care about \r and skipLF or skipCR or those kinds of stuff We are going to read the file( no need for syncrhonized) By extension no need for StringBuffer even otherwise StringBuilder can be used.Performance improvement immediately seen.

dangerous hack,remove synchronized and replace StringBuffer with StringBuilder don't use it without proper testing and not knowing what you are doing

public String readLine() throws IOException {
        StringBuilder s = null;
        int startChar;


        bufferLoop:
        for (; ; ) {

            if (nextChar >= nChars)
                fill();
            if (nextChar >= nChars) { /* EOF */
                if (s != null && s.length() > 0)
                    return s.toString();
                else
                    return null;
            }
            boolean eol = false;
            char c = 0;
            int i;

            /* Skip a leftover '\n', if necessary */


            charLoop:
            for (i = nextChar; i < nChars; i++) {
                c = cb[i];
                if (c == '\n') {
                    eol = true;
                    break charLoop;
                }
            }

            startChar = nextChar;
            nextChar = i;

            if (eol) {
                String str;
                if (s == null) {
                    str = new String(cb, startChar, i - startChar);
                } else {
                    s.append(cb, startChar, i - startChar);
                    str = s.toString();
                }
                nextChar++;
                return str;
            }

            if (s == null)
                s = new StringBuilder(defaultExpectedLineLength);
            s.append(cb, startChar, i - startChar);
        }
    }

Java 8 Intel i5 12 GB RAM Windows 10 Result:

Time taken for readFileUsingBufferedReaderFileChannel::

  • 2581635057 1849820885 1763992972 1770510738 1746444157 1733491399 1740530125 1723907177 1724280512 1732445638

Time taken for readFileUsingBufferedReader

  • 1851027073 1775304769 1803507033 1789979554 1786974538 1802675458 1789672780 1798036307 1789847714 1785302003

Time taken for readFileUsingCustomBufferedReader

  1. 1745220476 1721039975 1715383650 1728548462 1724746005 1718177466 1738026017 1748077438 1724608192 1736294175

Time taken for readFileUsingBufferedReaderByteFileChannel

  • 2872857919 2480237636 2917488143 2913491126 2880117231 2904614745 2911756298 2878777496 2892169722 2888091211

Time taken for readFileUsingFileChannelStream

  • 3039447073 2896156498 2538389366 2906287280 2887612064 2929288046 2895626578 2955326255 2897535059 2884476915

Process finished with exit code 0

3 Comments

FileChannel does not have a 'non-blocking nature'.
My bad, redacted
Anyway it is conclusive that bufferedReader readline is faster than all other methods based on my profiling
1

I did try NIO with all possible options(provided in this post and to the best of my knowledge and research) and found that it no where came close to BufferedReader in terms of reading a text file.

Changing BufferedReader to use StringBuilder in place of StringBuffer, I don't see any significant improvement in performance (only very few seconds for some files and some of them were better using StringBuffer itself).

Removing synchronized block also didn't give much/any improvement. And it's not worth to tweak something by which we didn't receive any benefit.

The below is the time taken(reading, processing, writing - time taken for processing and writing is not significant - not even 20% of time) for file which is around 50 GB NIO : 71.67 (Minutes) IO (BufferedReader) : 10.84 (Minutes)

Thank you all for your time to reading and responding to this post and providing suggestions.

Comments

0

The main issue here is creating a new byte[] very rapidly(fieldBytes = new byte[maxFieldSize];).

Since for every iteration a new array is being created, garbage collection is being kicked off very often which triggers "stop the world" to reclaim the memory.

And also, the object creation could be expensive.

We could rather initialize the byte array once and then track the indexes to just convert the field to string with an end index.

And anyway, BufferedReader is faster than FileChannel, atleast to read the ASCII files, and to keep the code simple, we continued using Bufferred Reader itself.

Using Bufferred reader, the development and testing effort can be reduced by not having tedious logic to find delimiters and populating the object.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.