How to format text using UNIX commands?

Question

I'm trying to display all the files in a directory that have the same contents in a specific way. If the file is unique, it does not need to be displayed. Any file that is identical to others need to be displayed on the same line separated by commas.

For example,

c176ada8afd5e7c6810816e9dd786c36  2group1
c176ada8afd5e7c6810816e9dd786c36  2group2
e5e6648a85171a4af39bbf878926bef3  4group1
e5e6648a85171a4af39bbf878926bef3  4group2
e5e6648a85171a4af39bbf878926bef3  4group3
e5e6648a85171a4af39bbf878926bef3  4group4
2d43383ddb23f30f955083a429a99452  unique
3925e798b16f51a6e37b714af0d09ceb  unique2

should be displayed as,

2group1, 2group2
4group1, 4group2, 4group3, 4group4

I know which files are considered unique in a directory from using md5sum, but I do not know how to do the formatting part. I think the solution involves awk or sed, but I am not sure. Any suggestions?

RomanPerekhrest · Accepted Answer · 2017-10-06 06:56:01Z

2

Awk solution (for your current input):

awk '{ a[$1]=a[$1]? a[$1]", "$2:$2 }END{ for(i in a) if(a[i]~/,/) print a[i] }' file

a[$1]=a[$1]? a[$1]", "$2:$2 - accumulating group names (from field $2) for each unique hash presented by the 1st field value $1. The array a is indexed by hashes with concatenated group names as a values (separated by a comma ,).
for(i in a) - iterating through array items
if(a[i]~/,/) print a[i] - means: if the hash associated with more than one group (separated by comma ,) - print the item

The output:

2group1, 2group2
4group1, 4group2, 4group3, 4group4

edited Oct 6, 2017 at 6:56

answered Oct 5, 2017 at 21:27

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Stephen P Over a year ago

As high as your rep is, I still have to say: a solution with no explanation is not a good answer.

RomanPerekhrest Over a year ago

@StephenP, You have my explanation. but ... I'll have to say you something: there are many-many answers on SO gained a huge amount of upvotes - without explanation. I also like explained answers, but the fairness is more important for me. P.S. I would not call my rep score as high, I think high is about 300k-500k

shaneb · Accepted Answer · 2017-10-05 21:39:14Z

Given the input you provided, you essentially want to collect all the second columns where the first column is the same. So the first step is use awk to hash the second columns by the first. I leverage the solution posted here: Concatenate lines by first column by awk or sed

awk '{table[$1]=table[$1] $2 ",";} END {for (key in table) print key " => " table[key];}' file

c176ada8afd5e7c6810816e9dd786c36 => 2group1,2group2,
e5e6648a85171a4af39bbf878926bef3 => 4group1,4group2,4group3,4group4,
3925e798b16f51a6e37b714af0d09ceb => unique2,
2d43383ddb23f30f955083a429a99452 => unique,

And if you really want to filter to exclude the unique ones, just make sure you have at least two fields (telling AWK to use ',' as the separator):

awk '{table[$1]=table[$1] $2 ",";} END {for (key in table) print key " => " table[key];}' file | awk -F ',' 'NF > 2'

c176ada8afd5e7c6810816e9dd786c36 => 2group1,2group2,
e5e6648a85171a4af39bbf878926bef3 => 4group1,4group2,4group3,4group4,

glenn jackman · Accepted Answer · 2017-10-05 23:56:28Z

0

perl:

perl -lane '
        push @{$groups{$F[0]}}, $F[1]
    } END {
        for $g (keys %groups) {
            print join ", ", @{$groups{$g}} if @{$groups{$g}} > 1
        }
' file

The order of the output is indeterminate.

answered Oct 5, 2017 at 23:56

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

Comments

potong · Accepted Answer · 2017-10-06 01:05:58Z

0

This might work for you (GNU sed):

sed -r 'H;x;s/((\S+)\s+\S+)((\n[^\n]+)*)\n\2\s+(\S+)/\1,\5\3/;x;$!d;x;s/.//;s/^\S+\s*//Mg;s/\n[^,]+$//Mg;s/,/, /g' file

Gather up all the lines of the file and use pattern matching to collapse the lines. At the end of the file, remove the keys and any unique lines and then print the remainder.

edited Oct 6, 2017 at 1:05

answered Oct 6, 2017 at 0:58

potong

59.3k6 gold badges55 silver badges92 bronze badges

Collectives™ on Stack Overflow

How to format text using UNIX commands?

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related