3

Lazy Filtering CSV Files

I had the need to filter through millions of log records, stored as numerous CSV files. The size of the records greatly exceeded my available memory so I wanted to go with a lazy approach.

Java 8 Streams API

With jdk8 we have the Streams API which paired with Apache commons-csv allows us to easily accomplish this.

public class LazyFilterer {

    private static Iterable<CSVRecord> getIterable(String fileName) throws IOException {
        return CSVFormat
                .DEFAULT
                .withFirstRecordAsHeader()
                .parse(new BufferedReader(new FileReader(fileName)));
    }

    public static void main(String[] args) throws Exception {
        File dir = new File("csv");

        for (File file : dir.listFiles()) {
            Iterable<CSVRecord> iterable = getIterable(file.getAbsolutePath());

            StreamSupport.stream(iterable.spliterator(), true)
                    .filter(c -> c.get("API_Call").equals("Updates"))
                    .filter(c -> c.get("Remove").isEmpty())
                    .forEach(System.out::println);
        }
    }
}

Performance

This graph from VisualVM shows the memory usage during the parsing of 2.3 GB of CSV files using a more complex filtration pipeline1 than shown above.

As you can see, the memory usage basically remains constant2 as the filtration occurs.

visualvm screenshot

Can you find another method to accomplish the same task more quickly while not increasing code complexity?

Any languages are welcome, Java is not necessarily preferred!

Footnotes

[1] - E.g. for each CSVRecord that matches on "API_Call" I might need to do some JSON deserialization and do additional filtering after that, or even create an object for certain records to facilitate additional computations.

[2] - The idle time at the beginning of the graph was a System.in.read() used to ensure that VisualVM was fully loaded before computation began.

5
  • 1
    You are contradicting yourself. A “naive algorithm that reads the entire CSV file to memory” can’t be faster when you say at the same time that the “size of the records greatly exceeded my available memory”. Commented Sep 20, 2016 at 15:25
  • True, good point. You can take that statement with either of the qualifications "if I had enough memory available" or "for a small subset of the data" Commented Sep 20, 2016 at 15:29
  • 1
    I don’t see the point of comparing the performance of a hypothetical scenario with a real one. Besides, you didn’t name the “naive implementation”, further, you don’t show any numbers regarding the performance. So your question is based on an empty claim that an unspecified implementation would be faster than what you did in an inapplicable hypothetical scenario. Commented Sep 20, 2016 at 15:39
  • @Holger I deleted the sentence as it was tangential to the question. If you would like to contribute an answer and need to determine the performance of your solution compared to the one I gave, you can generate some CSV files and run them both on your own local machine. I unfortunately can not supply any of the CSV files I am actually filtering. Commented Sep 20, 2016 at 15:46
  • 1
    It’s not “tangential to the question”. It’s completely unclear why you think that there must be a faster solution than the one you already have. And questions that merely ask for tools or libraries or off topic on SO. Commented Sep 20, 2016 at 15:53

1 Answer 1

2

That's horrible for just 2.3GB of data, may I suggest you trying to use uniVocity-parsers for better performance? Try this:

CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true); // grabs headers from input

//select the fieds you are interested in. The filtered ones get in front to make things easier
settings.selectFields("API_Call", "Remove"/*, ... and everything else you are interested in*/);

//defines a processor to filter the rows you want
settings.setProcessor(new AbstractRowProcessor() {
    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        if (row[0].equals("Updates") && row[1].isEmpty()) {
            System.out.println(Arrays.toString(row));
        }
    }
});

// create the parser
CsvParser parser = new CsvParser(settings);

//parses everything. All rows will be sent to the processor defined above
parser.parse(file, "UTF-8"); 

I know it's not functional but it took 20 seconds to process a 4 GB file I created to test this, while consuming less than 75mb of memory the whole time. From your graphic it seems your current approach takes 1 minute for a smaller file, and needs 10 times as much memory.

Give this example a try, I believe it will help considerably.

Disclaimer, I'm the author of this library, it's open-source and free (Apache 2.0 license)

Sign up to request clarification or add additional context in comments.

5 Comments

Awesome, thanks! However, the huge amount of memory and time used in the example was probably due to object instantiation in the stream. I was not just doing that simple filtration which I showed. Ill try using your library and see how it works.
by the way, would you mind sharing the 4GB file you created? I would like to try running both your code and the java streams code on my machine to compare apples to apples.
Glad to help. The file I used is nothing too special. I got this maxmind.com/download/worldcities/worldcitiespop.txt.gz and replicated its contents 30 times. I also selected columns "country" and "citty" instead. You may just use the original file (without expanding it) and run the same code a few times to get a decent benchmark. From what tried the overall performance is the same.
Marking this as the answer, as it was the fastest provided solution.
For more performance, see also github.com/skjolber/csv-benchmark#results

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.