Apache CSV parser is not working on tab delimited data with quotation marks

Question

I would like to parse a Google eBook transaction report. I opened it in Notepad++ to see precisely the filed and record delimiters. It is a tab delimited file, every header field and data field are surrounded by quotation marks. The first two lines of the CSV file are:

"Transaction Date" "Id"    "Product"   "Type"  "Preorder"  "Qty"   "Primary ISBN"  "Imprint Name"  "Title" "Author"    "Original List Price Currency"  "Original List Price"   "List Price Currency"   "List Price [tax inclusive]"    "List Price [tax exclusive]"    "Country of Sale"   "Publisher Revenue %"   "Publisher Revenue" "Payment Currency"  "Payment Amount"    "Currency Conversion Rate"
"2016. 09. 01." "ID:1166315449551685"   "Single Purchase"   "Sale"  "None"  "1" "9789633780664" "Book and Walk Kft" "Bánk bán"  "József Katona" "HUF"   "0,00"  "HUF"   "0,00"  "0,00"  "HU"    "52,0%" "0,00"  ""  ""  ""

I use the following code to parse the CSV file:

private List<Sales> parseCsv(File csv) {
    Calendar max = Calendar.getInstance();
    Calendar current = Calendar.getInstance();
    boolean firstRound = true;

    List<Sales> sales = new ArrayList<>();
    Sales currentRecord;
    Reader in;
    try {
        in = new FileReader(csv);
        Iterable<CSVRecord> records;

        try {

            records = CSVFormat.TDF.withQuote('\"').withFirstRecordAsHeader().parse(in);
            for (CSVRecord record : records) {
                currentRecord = new Sales();
                currentRecord.setAuthor(record.get("Author"));
                currentRecord.setTitle(record.get("Title"));
                currentRecord.setPublisher(record.get("Imprint Name"));
                currentRecord.setIsbn(record.get("Primary ISBN"));
                currentRecord.setChannel("Google");
                currentRecord.setBookId(record.get("Id"));
                currentRecord.setCountry(record.get("Country of Sale"));
                currentRecord.setUnits(Integer.parseInt(record.get("Qty")));
                currentRecord.setUnitPrice(Float.parseFloat(record.get("List Price [tax exclusive]")));

                Date transDate;
                try {
                    transDate = sourceDateFormat.parse(record.get("Transaction Date"));
                    if (firstRound) {
                        max.setTime(transDate);
                    };
                    current.setTime(transDate);
                    if (current.after(max)) {
                        max.setTime(current.getTime());
                    }
                    currentRecord.setDatum(transDate);
                } catch (ParseException e) {
                    // TODO Auto-generated catch block
                    LOG.log(Level.SEVERE,"Nem megfeelő formátumú a dátum a {0} file-ban",csv.getAbsolutePath());
                }

                currentRecord.setCurrencyCustomer(record.get("List Price Currency"));
                currentRecord.setCurrencyProceeds(record.get("Payment Amount"));
                currentRecord.setCurrencyProceeds(record.get("Payment Currency"));
                sales.add(currentRecord);
            }
            LOG.log(Level.INFO, "Daily sales transactions of {0} were successfully parsed from ",
                    csv.getAbsolutePath());
            return sales;
        } catch (IOException e1) {
            // TODO Auto-generated catch block
            LOG.log(Level.SEVERE, "Valami nem stimmel a {0} file szerkezetével",csv.getAbsolutePath());
        }
    } catch (FileNotFoundException e1) {
        // TODO Auto-generated catch block
        LOG.log(Level.SEVERE,"A {0} file-t nem találom.",csv.getAbsolutePath());
    }
    return null;
};

When I debug the parsing proccess than I can see that record.get("Author") threw runtime exception:

java.lang.IllegalArgumentException: Mapping for Author not found, expected one of [��"

Obviously I have column named Author. Any ide what goes wrong?

try to supply Charset to parse .parse(in, StandardCharsets.UTF_8) — tonakai
– tonakai, Commented Oct 20, 2016 at 8:36

centic · Accepted Answer · 2016-10-20 08:39:02Z

1

When converting this to a unit-test and running with the current commons-csv version 1.4 this works fine for me, therefore:

Check with the latest version of commons-csv
Make sure that there are actually TABs in the file, not blanks around the author-entries for some reason
Specify the actual encoding of the file when calling parse() to correctly handle non-ASCII characters (thanks to comments from @tonakai)

The following unit-test works fine with commons-csv 1.4

private final static String DATA = "\"Transaction Date\"\t\"Id\"\t\"Product\"\t\"Type\"\t\"Preorder\"\t\"Qty\"\t\"Primary ISBN\"\t\"Imprint Name\"\t\"Title\"\t\"Author\"\t\"Original List Price Currency\"\t\"Original List Price\"\t\"List Price Currency\"\t\"List Price [tax inclusive]\"\t\"List Price [tax exclusive]\"\t\"Country of Sale\"\t\"Publisher Revenue %\"\t\"Publisher Revenue\"\t\"Payment Currency\"\t\"Payment Amount\"\t\"Currency Conversion Rate\"\n" +
        "\"2016. 09. 01.\"\t\"ID:1166315449551685\"\t\"Single Purchase\"\t\"Sale\"\t\"None\"\t\"1\"\t\"9789633780664\"\t\"Book and Walk Kft\"\t\"Bánk bán\"\t\"József Katona\"\t\"HUF\"\t\"0,00\"\t\"HUF\"\t\"0,00\"\t\"0,00\"\t\"HU\"\t\"52,0%\"\t\"0,00\"\t\"\"\t\"\"\t\"\"";

@Test
public void parseCsv() throws IOException {
    final CSVFormat format = CSVFormat.TDF.withQuote('\"').withFirstRecordAsHeader();
    Iterable<CSVRecord> records = format.parse(new StringReader(DATA));

    System.out.println("Headers: " + Arrays.toString(format.getHeader()));

    for (CSVRecord record : records) {
        assertNotNull(record.get("Author"));
        assertNotNull(record.get("Title"));
        assertNotNull(record.get("Imprint Name"));
        assertNotNull(record.get("Primary ISBN"));
        assertNotNull(record.get("Id"));
        assertNotNull(record.get("Country of Sale"));
        assertNotNull(record.get("Qty"));
        assertNotNull(record.get("List Price [tax exclusive]"));

        assertNotNull(record.get("Transaction Date"));

        assertNotNull(record.get("List Price Currency"));
        assertNotNull(record.get("Payment Amount"));
        assertNotNull(record.get("Payment Currency"));

        System.out.println("Record: " + record.toString());
    }
}

answered Oct 20, 2016 at 8:39

centic

16k10 gold badges74 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Roland Over a year ago

I checked and I use 1.4 version. What is strange that when I got my maven deendencies in my Eclipse Ide and drill down the commons-csv-1.4.jar I see parse method with three different method signiture but the autocomplete of my ide offers the basic one.

tonakai Over a year ago

might be because you are passing Reader in, other signatures don't use Reader. if you want to use Reader, then try to construct it without FileReader as mentioned in docs.oracle.com/javase/7/docs/api/java/io/FileReader.html using InputStreamReader on a FileInputStream.

Roland Over a year ago

I use version 1.4, I also checked that all the fields are delimited by tab. I checked it with excel, and noepad++. I also impemented @tonakai cooment to handle non ASCII characters correctly. 'CSVFormat csvFormat = CSVFormat.TDF.withQuote('\"').withFirstRecordAsHeader(); records = CSVParser.parse(csv, StandardCharsets.UTF_8, csvFormat);'. I still get the same erro as before.

centic Over a year ago

Then you will need to narrow down the problem. First by running the unit-test that I posted and seeing if that works, then reduce your non-working code so that it becomes more and more like the unit-test, this way you will see at which step it breaks.

Marian Klühspies · Accepted Answer · 2019-05-14 06:32:48Z

0

It turned out that the encoding was the source of the problem. Based on @tonakai comment I started to analyze the encoding of the Google csv report. It was UTF-16 Little Endian. As my file included byte-order mark, I had to use 'BOMInputStream' and refactor my code a little bit.

Reader r = newReader(csv);
CSVParser csvParser= CSVFormat.TDF.withFirstRecordAsHeader().withQuoteMode(QuoteMode.ALL).parse(r);

.....

private InputStreamReader newReader(final File csv) throws FileNotFoundException {
        return new InputStreamReader(new BOMInputStream(new FileInputStream(csv),ByteOrderMark.UTF_16LE), StandardCharsets.UTF_16LE);
    }

It is working now

edited May 14, 2019 at 6:32

Marian Klühspies

18k20 gold badges112 silver badges167 bronze badges

answered Nov 20, 2016 at 13:40

Roland

3611 gold badge7 silver badges21 bronze badges

Collectives™ on Stack Overflow

Apache CSV parser is not working on tab delimited data with quotation marks

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related