4

I have a file like below which I have sorted based on the username field.

UserID score UserName
1234    200   Jack
5678    150   Jill
8543    200   Jill 
5678    100   John

I am trying to remove the rows which have the lowest score for the same usernames. So, I want to have the output as,

UserID score UserName
1234    200   Jack
8543    200   Jill 
5678    100   John
5
  • Are the rows sorted by UserName and score? Or could the two Jill rows be in the opposite order? Commented Mar 13, 2014 at 16:31
  • Those 2 rows are not sorted. It is sorted only based on the username column. Commented Mar 13, 2014 at 16:32
  • Is it an option for you to make the input sorted by both columns? That would make the solution much easier (to make and to understand). Commented Mar 13, 2014 at 16:40
  • 1
    What if Jill has 3 rows: do you want to keep the one highest, or remove the one lowest? Commented Mar 13, 2014 at 16:46
  • I just need to keep the highest. Commented Mar 13, 2014 at 16:56

3 Answers 3

4

The simplest approach would be to sort on the score field instead:

$ sort -nk2 file | awk '{k[$NF]=$0} END{for (i in k){print k[i]}}'
UserID score UserName
8543    200   Jill 
1234    200   Jack
5678    100   John

Or, in perl:

sort -nk2 file | perl -ane '$k{$F[$#F]}=$_; END{print "$k{$_}" for keys(%k)}'

The -a flag for perl turns on auto splitting, basically it will behave like awk, and split each line on white space, saving the fields in the array @F. The -n means process the input file line, by line.

$F[$#F] is the last element of @F, so the last field: the username. $k{$F[$#F]}=$_; saves each line in the hash %k where the keys are the usernames, overwriting whatever was there before. Since we first sort the file, this means that $k{username} will be the highest score for that username's entry. At the end, we print each line saved in %k.

9
  • As long as you're sorting, use -r and just print the first occurrence of each name Commented Mar 13, 2014 at 16:58
  • It's difficult to give testing input in a comment thus I abused an edit. The task is to delete one, your code deletes three lines. Commented Mar 13, 2014 at 16:58
  • @HaukeLaging yes, the OP was not clear but he actually want to keep only the highest score (which is what I had understood in the first place) so I rolled back your edit. Commented Mar 13, 2014 at 17:00
  • @Kevin why is that simpler? You would still need to parse to get only one line per username. Am I missing something? Commented Mar 13, 2014 at 17:05
  • @RahulPatil see updated answer. Commented Mar 13, 2014 at 17:41
4

Try this:

awk 'NR==1{print $1,$2,$3};NR!=1{if($2>a[$3]){a[$3]=$2;b[$3]=$1}}
    END{for(x in a){print b[x],a[x],x}}' OFS="\t" file

UserID  score   UserName
1234    200     Jack
8543    200     Jill
5678    100     John

Or using perl:

perl -ane '$h{$F[$#F]}=[$F[$#F-1],"$_"] if $F[$#F-1] > $h{$F[$#F]}->[0];
    END{print "$h{$_}->[1]" for keys %h}' file
3
  • Good awk solution unless Jack or John happen to have a negative score ;-) Also no guarantees that input order will be retained Commented Mar 13, 2014 at 17:17
  • Yeap, if that happened, we need one more check :). Commented Mar 13, 2014 at 17:22
  • keeping one array storing $0 would be a lot simpler Commented Mar 13, 2014 at 17:39
3

An alternate without arrays:

$ awk '
seen == $NF {line = (ishigh > $2) ? line : $0; next}
line {print line}
{seen = $NF; ishigh = $2; line = $0}
END {print line}' file
UserID score UserName
1234    200   Jack
8543    200   Jill
5678    100   John

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.