I randomly generated 7 million IDs which i saved into 7 different csv files because of the large size. Now i'd like to have got 7 csv files with 1 million IDs. What i am trying to check for duplicate IDs from all 7 csv files. Is there any way this can be done in java?
1 Answer
The only way to do it with Java is to load all 7 million IDs to memory. You can put them in a Set and for each new ID you load from file, check if it already exists in the Set. I'm assuming you would then have to write the output files without the duplicates.
I wouldn't do it with Java. A simple Unix/Linus shell script would do the trick (cat file1 file2 file3 file4 file5 file5 file6 file7 | sort | uniq would give you all the unique IDs, and then you can split them back into 7 files if you have to.
1 Comment
Hajo
Thanks for the response. I think i need to use java because i need to more than checking for duplicates afterwards. I am actually a newbie> Please can you explain further how to go about loading the IDs to a Set and check for the duplicates.