Linux Bash commands to remove duplicates from a CSV file

Question

I will be combining a number of CSV files. What I am trying to do is to:

1) Remove duplicate rows from the file, however, I need to check multiple columns as the criteria for what consists as a duplicate. How do I do that?

2) It would be nice to then create a 2nd output file to see what was removed in case something was removed that was not supposed to be removed.

3) Create a list of items as an input file to run as a (if this row contains this word in a this particular column, then remove the entire row.

If someone could help me with the commands to do this, that would be great! Please let me know if I need to clarify.

Here is a sample of what the data looks like (here is an example as suggested):

I have a csv file like this :

column1    column2

john       kerry
adam       stephenson
ashley     hudson
john       kerry
etc..

I want to remove duplicates from this file, to get only for the question at 1:

column1    column2

john       kerry
adam       stephenson
ashley     hudson

For question 3, I want to take the 2nd list...meaning the output of the 1st list and scrub this futher. I want a file like input.txt that contains:

adam

Then, the final output will be:

column1    column2

john       kerry
ashley     hudson

So, the input.txt file in the example contains the word adam (this way I can make a long list of words to check in the input.txt file). For #3, I need a code snipet that will check column 1 of all lines of the CSV for all the words input file, then remove any matches from the csv.

Perl and awk are well-suited for this kind of work. You will probably get better answers if you provide an example input file and show what you've already tried. — tkocmathla
– tkocmathla, Commented Aug 19, 2014 at 21:38
Just added. I know nothing about Perl or Awk, but I do know Bash. I hope someone can give me Bash commands. — Peaceful_Warrior
– Peaceful_Warrior, Commented Aug 19, 2014 at 21:50
Just to make sure: you check some columns for equality, which means some columns may differ but are still counted as equal (in other words: uniq does not work)? — Wrikken
– Wrikken, Commented Aug 19, 2014 at 21:59
I agree with tkocmathla... You might want to do this using something more suited than bash for this kind of work. And i'd like to add python to the list. — Sankalp
– Sankalp, Commented Aug 19, 2014 at 22:09
I would take your csv file and put it into an sqlite database or something. If you're trying to compare equality based on some combination of columns, you're going to find that VERY hard in bash. — Falmarri
– Falmarri, Commented Aug 19, 2014 at 23:32

jaypal singh · Accepted Answer · 2014-08-20 15:13:41Z

8

You need to provide more details for question 3, but for question 1 and 2 the following awk one-liner will work.

awk 'seen[$0]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv

And with some whitespace added for clarity:

awk 'seen[$0]++ {
  print $0 > "dups.csv"; next
}
{
  print $0 > "new.csv"
}' mycsv

This will not print anything to STDOUT but will create two files. dups.csv will contain all the duplicates (that is if there are 5 entries of same line, this file will contain 4 entries that were removed as dups) that were removed and new.csv will contain all unique rows.

seen[$0]++ is a test we do for each line. If the line is present in our array it will be inserted to dups.csv file and we will move to the next line using next. If line is not present we will add that line to the array and write it to new.csv file.

Use of $0 means entire line. If you want to specify fewer columns, you can do so. You just need to set the input field separator based on delimiter. You have mentioned csv but I don't see any comma delimiters so I am using the default separator which is [[:space:]]+.

Also, it is comma separated, I was just putting sample data up. So, if I want to use the above example but want to test only columns 3 & 4 (using the seen command), how would I do that in a comma separated file?

For true csv just set the field separator to ,. seen is not a command. It is a hash that retains column as keys. So you will modify the above command to:

awk -F, 'seen[$3,$4]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv

Update:

Once you have a list without dups using the commands stated above. We are left with:

$ cat new.csv 
john,kerry
adam,stephenson
ashley,hudson

$ cat remove.txt 
adam

$ awk -F, 'NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt new.csv 
john,kerry
ashley,hudson

edited Aug 20, 2014 at 15:13

answered Aug 19, 2014 at 22:11

jaypal singh

77.6k24 gold badges108 silver badges147 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Tom Fenech Over a year ago

The default separator is [[:space:]]+, isn't it?

jaypal singh Over a year ago

@TomFenech Yea, was intending on writing sequences of [:space:] but [[:space:]]+ is shorter. Will update, thanks!

Peaceful_Warrior Over a year ago

Thanks! I just updated for question 3. Also, it is comma separated, I was just putting sample data up. So, if I want to use the above example but want to test only columns 3 & 4 (using the seen command), how would I do that in a comma separated file?

jaypal singh Over a year ago

@Peaceful_Warrior Not sure if I follow. Whats the algorithm to create a file that just contains adam? Will update the post to answer your other question.

jaypal singh Over a year ago

@Peaceful_Warrior Thanks but you still haven't mentioned why the list will only contain adam and not john or ashley?

|

Collectives™ on Stack Overflow

Linux Bash commands to remove duplicates from a CSV file

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related