Java CSV parser with string separator (multi-character)

Question

Is there any Java open source library that supports multi-character (i.e., String with length > 1) separators (delimiters) for CSV?

By definition, CSV = Comma-Separated Values data with a single character (',') as the delimiter. However, many other single-character alternatives exist (e.g., tab), making CSV to stand for "Character-Separated Values" data (essentially, DSV: Delimiter-Separated Values data).

Main Java open source libraries for CSV (e.g., OpenCSV) support virtually any character as the delimiter, but not string (multi-character) delimiters. So, for data separated with strings like "|||" there is no other option than preprocessing the input in order to transform the string to a single-character delimiter. From then on, the data can be parsed as single-character separated values.

It would therefore be nice if there was a library that supported string separators natively, so that no preprocessing was necessary. This would mean that CSV now standed for "CharSequence-Separated Values" data. :-)

You could write your own lib. There is not much to it. Read every line from the file and split it with your regex or delimiters. — juergen d
– juergen d, Commented Dec 28, 2011 at 9:02
Not so straightforward, because CSV can have quoted fields, multiline records, etc. Also, there are countless options on quotes, escape characters, etc. Have a look at secretgeek.net/csv_trouble.asp for a funny overview fo the issues you may run into. — PNS
– PNS, Commented Dec 28, 2011 at 9:12
That would be a need, indeed, which is why (among many other reasons) a mature library is preferable, but all the ones I have played with seem to support on single-character separators. — PNS
– PNS, Commented Dec 28, 2011 at 9:19
@gnat FlatPack seems to support only single-character separators, as well. — PNS
– PNS, Commented Dec 28, 2011 at 9:20
@gnat As I say in the question, "So, for data separated with strings like "|||" there is no other option that preprocessing the input in order to transform the string to a single-character delimiter." :-) — PNS
– PNS, Commented Dec 28, 2011 at 9:36

Mark O'Connor · Accepted Answer · 2012-01-02 19:32:59Z

5

This is a good question. The problem was not obvious to me until I looked at the javadocs and realised that opencsv only supports a character as a separator, not a string....

Here's a couple of suggested work-arounds (Examples in Groovy can be converted to java).

Ignore implicit intermediary fields

Continue to Use OpenCSV, but ignore the empty fields. Obviously this is a cheat, but it will work fine for parsing well-behaved data.

    CSVParser csv = new CSVParser((char)'|')

    String[] result = csv.parseLine('J||Project report||"F, G, I"||1')

    assert result[0] == "J"
    assert result[2] == "Project report"
    assert result[4] == "F, G, I"
    assert result[6] == "1"

or

    CSVParser csv = new CSVParser((char)'|')

    String[] result = csv.parseLine('J|||Project report|||"F, G, I"|||1')

    assert result[0] == "J"
    assert result[3] == "Project report"
    assert result[6] == "F, G, I"
    assert result[9] == "1"

Roll your own

Use the Java String tokenizer method.

    def result = 'J|||Project report|||"F, G, I"|||1'.tokenize('|||')

    assert result[0] == "J"
    assert result[1] == "Project report"
    assert result[2] == "\"F, G, I\""
    assert result[3] == "1"

Disadvantage of this approach is that you lose the ability to ignore quote characters or escape separators..

Update

Instead of pre-processing the data, altering it's content, why not combine both of the above approaches in a two step process:

Use the "roll your own" to first validate the data. Split each line and prove that it contains the requiste number of fields.
Use the "field ignoring" approach to parse the validated data, secure in the knowledge that the correct number of fields have been specified.

Not very efficient, but possibly easier that writing your own CSV parser :-)

edited Jan 2, 2012 at 19:32

answered Dec 31, 2011 at 17:17

Mark O'Connor

78.3k11 gold badges145 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

PNS Over a year ago

Mark, the "field ignoring" approach is clever, but it won't work for strings that consist of more than 1 different characters. What I have also thought is using the first (or last) character of the string delimiter as the separator and then remove the remaining part of the delimiter, which would appear at the start of every field. Still, this won't work if that character is a common one, i.e. it is encountered in more places than the number of delimiters. The "rolling your own option" is not as easy as it first seems. Check secretgeek.net/csv_trouble.asp for some good reasons why.

Mark O'Connor Over a year ago

I understand the limitations with both solutions. As stated the "field ignoring" approach is really only good for parsing well behaved data. As you noted if someone uses an incorrect number of separation characters it breaks assumptions you've made about the data. The "rolling your own" option is really to prove that it can be done, I'd never bother, again, unless the data is incredibly well behaved. In my experience CSV data rarely is.....

PNS Over a year ago

You are right. My experience, too, concurs that CSV data is often not well-formed. +1

Luis Muñiz Over a year ago

FWIW, here's my €0.02: Create a preprocessing Reader that will transform whatever String sequence into a Character, and feed this reader to openCSV.

Mark Teese Over a year ago

Apache Commons CSV doesn't seem to have this function either. According to withRecordSeparator documentation, Parsing currently only works for inputs with '\n', '\r' and "\r\n".

Andrei Filipchyk · Accepted Answer · 2022-10-24 13:41:03Z

2

In 2022 openCSV version 5.7.1 still doesn't support multi-character separator.

Solution - use appache commons-csv, version 1.9.0 support multi-character separator!

CSVFormat.Builder.create().setDelimiter(separator);

answered Oct 24, 2022 at 13:41

Andrei Filipchyk

1821 silver badge18 bronze badges

2 Comments

Shuai Liu Over a year ago

This should be the simplist solution, it work for me,

Shuai Liu Over a year ago

if the input stream is gzip stream, the commons-csv 1.10.0 will parse the column which delimited by multiple-characters incorrectly, use carefully

Peter · Accepted Answer · 2018-10-12 19:06:30Z

None of these solutions worked for me, because they all assumed you could store the entire CSV file in memory allowing for easy replaceAll type actions.

I know it's slow, but I went with Scanner. It has a surprising number of features, and makes rolling your own simple CSV reader with any string you want as a record delimiter. It also lets you parse very large CSV files (I've done 10GB single files before), since you can read records one at a time.

Scanner s = new Scanner(inputStream, "UTF-8").useDelimiter(">|\n");

I would prefer a faster solution, but no library I've found supports it. FasterXML has had an open ticket to add this funcitonality since early 2017: https://github.com/FasterXML/jackson-dataformats-text/issues/14

Niranjan Ravichandran · Accepted Answer · 2022-08-11 08:25:09Z

WorkAround to use delimiter || : Add dummy fields in between the needed columns

public class ClassName {
    @CsvBindByPosition(position = 0)
    private String column1;
    @CsvBindByPosition(position = 1)
    private String dummy1;
    @CsvBindByPosition(position = 2)
    private String column2;
    @CsvBindByPosition(position = 3)
    private String dummy2;
    @CsvBindByPosition(position = 4)
    private String column3;
    @CsvBindByPosition(position = 5)
    private String dummy5;
    @CsvBindByPosition(position = 6)
    private String column4;
}
And then parse them using 
List<ClassName> responses = new CsvToBeanBuilder<ClassName>(new FileReader("test.csv"))
                .withType(ClassName.class)
                .withSkipLines(1) // to skip header
                .withSeparator('|')
                // to parse || , we use |
                .build()
                .parse();

Shuai Liu · Accepted Answer · 2023-11-15 09:49:36Z

1

Try univocity-parsers, which supports multi-character delimiters and has the best performance.

As for commons-csv: If the input stream is a gzip stream, commons-csv 1.10.0 will incorrectly parse columns delimited by multiple characters, so use it carefully.

edited Nov 15, 2023 at 9:49

answered Nov 15, 2023 at 9:07

Shuai Liu

6986 silver badges10 bronze badges

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

-2

Try opencsv.

It does everything you need, including (and especially) handling embedded delimiters within quoted values (eg "a,b", "c" parses as ["a,b", "c"])

I've used it successfully and I liked it.

Edited:

Since opencsv handles only single-character separators, you could work around this thus:

String input;
char someCharNotInInput = '|';
String delimiter = "abc"; // or whatever
input.replaceAll(delimiter, someCharNotInInput);
new CSVReader(input, someCharNotInInput); // etc
// Put it back into each value read
value.replaceAll(someCharNotInInput, delimiter); // in case it's inside delimiters

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Dec 28, 2011 at 9:13

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

5 Comments

PNS Over a year ago

OpenCSV is an excellent library, but it only supports single-character separators, not multi-character ones.

PNS Over a year ago

The issue is not handling any form of single-character delimiters (including embedded ones), but handling multi-character delimiters. :-)

PNS Over a year ago

Yes, that's the "preprocessing" step I was talking about in the question, thanks.

Bart Kiers Over a year ago

But such a replacement would not make a distinction between delimiters in- or outside quoted values.

PNS Over a year ago

It won't, but the "restoration" of the original value fixes that. In general, preprocessing is doable but not optimal, which is why I posted the question in the first place.

Collectives™ on Stack Overflow

Java CSV parser with string separator (multi-character)

6 Answers 6

Ignore implicit intermediary fields

Roll your own

Update

5 Comments

2 Comments

Comments

Comments

Comments

Edited:

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Ignore implicit intermediary fields

Roll your own

Update

5 Comments

2 Comments

Comments

Comments

Comments

Edited:

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related