Remove duplicate from csv using bash / awk

Question

I have a csv file with the format :

"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"

I want to group by first column unique id's and concat types in a single row like this:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

I found awk does a great job in handling such scenarios. But all I could achieve is this:

"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"

I used this command:

awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file

How can I remove the duplicates and also handle the formatting of the second column types?

Why "id-2"|"B:C" instead of "id-2"|"C:B" in output when C value comes first. — anubhava
– anubhava, Commented Oct 12, 2017 at 13:57
@Qedrix just be aware that any awk solution using the in operator (e.g. for (i in array)) unless it's gawk and sets sorted_in will not produce sorted output - if the output looks like it IS sorted, that's pure coincidence with your specific data set and you can be sure it will not be with other input. — Ed Morton
– Ed Morton, Commented Oct 12, 2017 at 18:38

Sundeep · Accepted Answer · 2017-10-12 14:10:11Z

2

quick fix:

$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file 
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"

!seen[$0]++ will be true only if line was not already seen

If second column should all be within double quotes

$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
                 !seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
                 END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"

edited Oct 12, 2017 at 14:10

answered Oct 12, 2017 at 13:55

Sundeep

23.9k2 gold badges35 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RomanPerekhrest Over a year ago

should not it be "B:C" in the 2nd line?

Sundeep Over a year ago

@RomanPerekhrest I am not sorting it, hopefully OP will clarify if that is a requirement

Ed Morton · Accepted Answer · 2017-10-12 18:35:02Z

2

With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:

$ awk -F'|' '
    { a[$1][gensub(/"/,"","g",$2)] }
    END {
        PROCINFO["sorted_in"] = "@ind_str_asc"
        for (i in a) {
            c = 0
            for (j in a[i]) {
                printf "%s%s", (c++ ? ":" : i "|\""), j
            }
            print "\""
        }
    }
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.

edited Oct 12, 2017 at 18:35

answered Oct 12, 2017 at 15:06

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

3 Comments

anubhava Over a year ago

This is best answer here. Though I don't understand how it got B:C instead of C:B since C appeared first in input.

Ed Morton Over a year ago

Pure coincidence. The in operator visits the array indices in hash order (or other randomness but that's the usual...) so the output could've been in any order.

Ed Morton Over a year ago

@anubhava The OP just added a comment that she wanted the output sorted so I've add thesorted_in statement to take care of that.

RomanPerekhrest · Accepted Answer · 2017-10-12 16:08:17Z

1

Short GNU datamash + tr solution:

datamash -st'|' -g1 unique 2 <file | tr ',' ':'

The output:

"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"

----------

In case if between-item double quotes should be eliminated - use the following alternative:

datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'

The output:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

edited Oct 12, 2017 at 16:08

answered Oct 12, 2017 at 14:00

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Comments

Akshay Hegde · Accepted Answer · 2017-10-12 14:31:06Z

For sample, input below one will work, but unsorted

One-liner

# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile

# using regexp 
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] :  a[$1]":"$2  ) : $2}END{for(i in a)print i,a[i]}' infile

Test Results:

$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"

$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"    

$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] :  a[$1]":"$2  ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"

Better Readable:

Using regexp

awk 'BEGIN{
           FS=OFS="|"
     }
     { 
           a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
     }
     END{
           for(i in a)
              print i,a[i]
     }
     ' infile

Using two array

awk 'BEGIN{
          FS=OFS="|"
     }
     !seen[$1,$2]++{ 
             a[$1] = ($1 in a ? a[$1] ":" : "") $2
     }
  END{
           for(i in a)
               print i,a[i]
     }' infile

Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if you want to prefer some other column, you may prefer !seen[$1,$2]++, here column1 and column2 are used as index

RomanPerekhrest · Accepted Answer · 2017-10-12 15:28:10Z

0

awk + sort solution:

awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
           END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)

The output:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

edited Oct 12, 2017 at 15:28

answered Oct 12, 2017 at 13:55

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Collectives™ on Stack Overflow

Remove duplicate from csv using bash / awk

5 Answers 5

2 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related