2

I have a csv file with the format :

"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"

I want to group by first column unique id's and concat types in a single row like this:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

I found awk does a great job in handling such scenarios. But all I could achieve is this:

"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"

I used this command:

awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file

How can I remove the duplicates and also handle the formatting of the second column types?

3
  • 3
    Why "id-2"|"B:C" instead of "id-2"|"C:B" in output when C value comes first. Commented Oct 12, 2017 at 13:57
  • @anubhava I am also looking for a sorted result list. Commented Oct 12, 2017 at 17:07
  • @Qedrix just be aware that any awk solution using the in operator (e.g. for (i in array)) unless it's gawk and sets sorted_in will not produce sorted output - if the output looks like it IS sorted, that's pure coincidence with your specific data set and you can be sure it will not be with other input. Commented Oct 12, 2017 at 18:38

5 Answers 5

2

quick fix:

$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file 
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
  • !seen[$0]++ will be true only if line was not already seen


If second column should all be within double quotes

$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
                 !seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
                 END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"
Sign up to request clarification or add additional context in comments.

2 Comments

should not it be "B:C" in the 2nd line?
@RomanPerekhrest I am not sorting it, hopefully OP will clarify if that is a requirement
2

With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:

$ awk -F'|' '
    { a[$1][gensub(/"/,"","g",$2)] }
    END {
        PROCINFO["sorted_in"] = "@ind_str_asc"
        for (i in a) {
            c = 0
            for (j in a[i]) {
                printf "%s%s", (c++ ? ":" : i "|\""), j
            }
            print "\""
        }
    }
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.

3 Comments

This is best answer here. Though I don't understand how it got B:C instead of C:B since C appeared first in input.
Pure coincidence. The in operator visits the array indices in hash order (or other randomness but that's the usual...) so the output could've been in any order.
@anubhava The OP just added a comment that she wanted the output sorted so I've add thesorted_in statement to take care of that.
1

Short GNU datamash + tr solution:

datamash -st'|' -g1 unique 2 <file | tr ',' ':'

The output:

"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"

----------

In case if between-item double quotes should be eliminated - use the following alternative:

datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'

The output:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

Comments

0

For sample, input below one will work, but unsorted

One-liner

# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile

# using regexp 
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] :  a[$1]":"$2  ) : $2}END{for(i in a)print i,a[i]}' infile

Test Results:

$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"

$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"    

$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] :  a[$1]":"$2  ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"

Better Readable:

Using regexp

awk 'BEGIN{
           FS=OFS="|"
     }
     { 
           a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
     }
     END{
           for(i in a)
              print i,a[i]
     }
     ' infile

Using two array

awk 'BEGIN{
          FS=OFS="|"
     }
     !seen[$1,$2]++{ 
             a[$1] = ($1 in a ? a[$1] ":" : "") $2
     }
  END{
           for(i in a)
               print i,a[i]
     }' infile

Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if you want to prefer some other column, you may prefer !seen[$1,$2]++, here column1 and column2 are used as index

Comments

0

awk + sort solution:

awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
           END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)

The output:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.