0

I'm trying to display all the files in a directory that have the same contents in a specific way. If the file is unique, it does not need to be displayed. Any file that is identical to others need to be displayed on the same line separated by commas.

For example,

c176ada8afd5e7c6810816e9dd786c36  2group1
c176ada8afd5e7c6810816e9dd786c36  2group2
e5e6648a85171a4af39bbf878926bef3  4group1
e5e6648a85171a4af39bbf878926bef3  4group2
e5e6648a85171a4af39bbf878926bef3  4group3
e5e6648a85171a4af39bbf878926bef3  4group4
2d43383ddb23f30f955083a429a99452  unique
3925e798b16f51a6e37b714af0d09ceb  unique2

should be displayed as,

2group1, 2group2
4group1, 4group2, 4group3, 4group4

I know which files are considered unique in a directory from using md5sum, but I do not know how to do the formatting part. I think the solution involves awk or sed, but I am not sure. Any suggestions?

4 Answers 4

2

Awk solution (for your current input):

awk '{ a[$1]=a[$1]? a[$1]", "$2:$2 }END{ for(i in a) if(a[i]~/,/) print a[i] }' file

  • a[$1]=a[$1]? a[$1]", "$2:$2 - accumulating group names (from field $2) for each unique hash presented by the 1st field value $1. The array a is indexed by hashes with concatenated group names as a values (separated by a comma ,).

  • for(i in a) - iterating through array items

  • if(a[i]~/,/) print a[i] - means: if the hash associated with more than one group (separated by comma ,) - print the item


The output:

2group1, 2group2
4group1, 4group2, 4group3, 4group4
Sign up to request clarification or add additional context in comments.

2 Comments

As high as your rep is, I still have to say: a solution with no explanation is not a good answer.
@StephenP, You have my explanation. but ... I'll have to say you something: there are many-many answers on SO gained a huge amount of upvotes - without explanation. I also like explained answers, but the fairness is more important for me. P.S. I would not call my rep score as high, I think high is about 300k-500k
0

Given the input you provided, you essentially want to collect all the second columns where the first column is the same. So the first step is use awk to hash the second columns by the first. I leverage the solution posted here: Concatenate lines by first column by awk or sed

awk '{table[$1]=table[$1] $2 ",";} END {for (key in table) print key " => " table[key];}' file

c176ada8afd5e7c6810816e9dd786c36 => 2group1,2group2,
e5e6648a85171a4af39bbf878926bef3 => 4group1,4group2,4group3,4group4,
3925e798b16f51a6e37b714af0d09ceb => unique2,
2d43383ddb23f30f955083a429a99452 => unique,

And if you really want to filter to exclude the unique ones, just make sure you have at least two fields (telling AWK to use ',' as the separator):

awk '{table[$1]=table[$1] $2 ",";} END {for (key in table) print key " => " table[key];}' file | awk -F ',' 'NF > 2'

c176ada8afd5e7c6810816e9dd786c36 => 2group1,2group2,
e5e6648a85171a4af39bbf878926bef3 => 4group1,4group2,4group3,4group4,

Comments

0

perl:

perl -lane '
        push @{$groups{$F[0]}}, $F[1]
    } END {
        for $g (keys %groups) {
            print join ", ", @{$groups{$g}} if @{$groups{$g}} > 1
        }
' file

The order of the output is indeterminate.

Comments

0

This might work for you (GNU sed):

sed -r 'H;x;s/((\S+)\s+\S+)((\n[^\n]+)*)\n\2\s+(\S+)/\1,\5\3/;x;$!d;x;s/.//;s/^\S+\s*//Mg;s/\n[^,]+$//Mg;s/,/, /g' file

Gather up all the lines of the file and use pattern matching to collapse the lines. At the end of the file, remove the keys and any unique lines and then print the remainder.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.