checking for duplicate rows in csv files - java

Question

I randomly generated 7 million IDs which i saved into 7 different csv files because of the large size. Now i'd like to have got 7 csv files with 1 million IDs. What i am trying to check for duplicate IDs from all 7 csv files. Is there any way this can be done in java?

Eran · Accepted Answer · 2014-07-20 11:02:18Z

1

The only way to do it with Java is to load all 7 million IDs to memory. You can put them in a Set and for each new ID you load from file, check if it already exists in the Set. I'm assuming you would then have to write the output files without the duplicates.

I wouldn't do it with Java. A simple Unix/Linus shell script would do the trick (cat file1 file2 file3 file4 file5 file5 file6 file7 | sort | uniq would give you all the unique IDs, and then you can split them back into 7 files if you have to.

answered Jul 20, 2014 at 11:02

Eran

395k57 gold badges726 silver badges793 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hajo Over a year ago

Thanks for the response. I think i need to use java because i need to more than checking for duplicates afterwards. I am actually a newbie> Please can you explain further how to go about loading the IDs to a Set and check for the duplicates.

Collectives™ on Stack Overflow

checking for duplicate rows in csv files - java

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related