1

Let's consider a scenario

  1. Accounts.csv
  2. Transaction.csv

We have a mapping of each account number to transaction details, so 1 account number can have multiple transactions. Using these details we have to generate PDF for each account

If suppose, transaction CSV file is very large(>1 GB), then loading all the details and parsing could be the memory issue. So what could be the best approach to parse the transaction file ? Reading chunk by chunk also leading to memory consumption. Please advice

2
  • 2
    I would be loading them into a database and then executing queries. Commented Mar 18, 2019 at 7:30
  • 1GB would not be considered "very large" IMO. With a reasonably big heap this would not be any problem at all (particularly if you read it chunk by chunk). Loading it into a DB would be an enormous waste of time and resources. Commented Mar 18, 2019 at 7:54

3 Answers 3

1

As others have said a Database would be a good solution.

Alternatively you could sort the 2 files on th account number. Most Operating systems provide efficient file sorting programs, e.g. for linux (sorting on 5th column)

LC_ALL=C sort -t, -k5 file.csv > sorted.csv

taken from Sorting csv file by 5th column using bash

You can then read the 2 files sequentially

Your Programming logic is:

if (Accounts.accountNumber < Transaction.accountNumber) {
    read Accounts file
} else if (Accounts.accountNumber = Transaction.accountNumber) {
    process transaction
    read Transaction file
} else {
    read Transaction file
}

The memory requirements will be tiny, you only need to hold one record from each file in memory.

Sign up to request clarification or add additional context in comments.

Comments

0

Let's say you are using Oracle as Database,. you could load the data into its corresponding tables using the Oracle SQL Loader tool.

Once the data is loaded you could use simple SQL Queries to Join and Query data from the loaded tables.

This will work in all types of Databases but you will need to find the appropriate tool for loading the data.

Comments

0

Of cause importing the data to a database first would be the most elegant way. Beside that your question leaves the impression that this isn‘t an option.

So I recommend you read the transactions.csv line-by-line (for instance by using a BufferedReader). Because in CSV Format each line is a record you can then (while reading) filter out and forget about each record that is not for your current account. After one file-traversal you have all transactions for one account and that should usually fit into memory. A downfall of this approach is that you end up reading the transactions multiple times, once for each accounts PDF generation. But if your application would need to be highly optimized, I suggest you would have already used a database.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.