Efficient way of processing large CSV file using java

Question

Let's consider a scenario

Accounts.csv
Transaction.csv

We have a mapping of each account number to transaction details, so 1 account number can have multiple transactions. Using these details we have to generate PDF for each account

If suppose, transaction CSV file is very large(>1 GB), then loading all the details and parsing could be the memory issue. So what could be the best approach to parse the transaction file ? Reading chunk by chunk also leading to memory consumption. Please advice

I would be loading them into a database and then executing queries. — user207421
– user207421, Commented Mar 18, 2019 at 7:30
1GB would not be considered "very large" IMO. With a reasonably big heap this would not be any problem at all (particularly if you read it chunk by chunk). Loading it into a DB would be an enormous waste of time and resources. — marthursson
– marthursson, Commented Mar 18, 2019 at 7:54

Bruce Martin · Accepted Answer · 2019-03-18 11:47:29Z

1

As others have said a Database would be a good solution.

Alternatively you could sort the 2 files on th account number. Most Operating systems provide efficient file sorting programs, e.g. for linux (sorting on 5th column)

LC_ALL=C sort -t, -k5 file.csv > sorted.csv

taken from Sorting csv file by 5th column using bash

You can then read the 2 files sequentially

Your Programming logic is:

if (Accounts.accountNumber < Transaction.accountNumber) {
    read Accounts file
} else if (Accounts.accountNumber = Transaction.accountNumber) {
    process transaction
    read Transaction file
} else {
    read Transaction file
}

The memory requirements will be tiny, you only need to hold one record from each file in memory.

edited Mar 18, 2019 at 11:47

answered Mar 18, 2019 at 9:14

Bruce Martin

10.6k1 gold badge29 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Linus · Accepted Answer · 2019-03-18 07:34:00Z

0

Let's say you are using Oracle as Database,. you could load the data into its corresponding tables using the Oracle SQL Loader tool.

Once the data is loaded you could use simple SQL Queries to Join and Query data from the loaded tables.

This will work in all types of Databases but you will need to find the appropriate tool for loading the data.

answered Mar 18, 2019 at 7:34

Linus

95014 silver badges27 bronze badges

Comments

Thysce · Accepted Answer · 2019-03-18 07:48:34Z

Of cause importing the data to a database first would be the most elegant way. Beside that your question leaves the impression that this isn‘t an option.

So I recommend you read the transactions.csv line-by-line (for instance by using a BufferedReader). Because in CSV Format each line is a record you can then (while reading) filter out and forget about each record that is not for your current account. After one file-traversal you have all transactions for one account and that should usually fit into memory. A downfall of this approach is that you end up reading the transactions multiple times, once for each accounts PDF generation. But if your application would need to be highly optimized, I suggest you would have already used a database.

Collectives™ on Stack Overflow

Efficient way of processing large CSV file using java

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related