awk remove the rows based on another column value

Question

I have a file like below which I have sorted based on the username field.

UserID score UserName
1234    200   Jack
5678    150   Jill
8543    200   Jill 
5678    100   John

I am trying to remove the rows which have the lowest score for the same usernames. So, I want to have the output as,

UserID score UserName
1234    200   Jack
8543    200   Jill 
5678    100   John

Are the rows sorted by UserName and score? Or could the two Jill rows be in the opposite order? — Hauke Laging
– Hauke Laging, Commented Mar 13, 2014 at 16:31
Those 2 rows are not sorted. It is sorted only based on the username column. — Ramesh
– Ramesh, Commented Mar 13, 2014 at 16:32
Is it an option for you to make the input sorted by both columns? That would make the solution much easier (to make and to understand). — Hauke Laging
– Hauke Laging, Commented Mar 13, 2014 at 16:40
What if Jill has 3 rows: do you want to keep the one highest, or remove the one lowest? — glenn jackman
– glenn jackman, Commented Mar 13, 2014 at 16:46

terdon · Accepted Answer · 2014-03-13 17:41:08Z

4

The simplest approach would be to sort on the score field instead:

$ sort -nk2 file | awk '{k[$NF]=$0} END{for (i in k){print k[i]}}'
UserID score UserName
8543    200   Jill 
1234    200   Jack
5678    100   John

Or, in perl:

sort -nk2 file | perl -ane '$k{$F[$#F]}=$_; END{print "$k{$_}" for keys(%k)}'

The -a flag for perl turns on auto splitting, basically it will behave like awk, and split each line on white space, saving the fields in the array @F. The -n means process the input file line, by line.

$F[$#F] is the last element of @F, so the last field: the username. $k{$F[$#F]}=$_; saves each line in the hash %k where the keys are the usernames, overwriting whatever was there before. Since we first sort the file, this means that $k{username} will be the highest score for that username's entry. At the end, we print each line saved in %k.

edited Mar 13, 2014 at 17:41

answered Mar 13, 2014 at 16:43

terdon♦

253k69 gold badges481 silver badges719 bronze badges

As long as you're sorting, use -r and just print the first occurrence of each name

Kevin
– Kevin

2014-03-13 16:58:31 +00:00
Commented Mar 13, 2014 at 16:58
It's difficult to give testing input in a comment thus I abused an edit. The task is to delete one, your code deletes three lines.

Hauke Laging
– Hauke Laging

2014-03-13 16:58:40 +00:00
Commented Mar 13, 2014 at 16:58
@HaukeLaging yes, the OP was not clear but he actually want to keep only the highest score (which is what I had understood in the first place) so I rolled back your edit.

terdon
– terdon ♦

2014-03-13 17:00:52 +00:00
Commented Mar 13, 2014 at 17:00
@Kevin why is that simpler? You would still need to parse to get only one line per username. Am I missing something?

terdon
– terdon ♦

2014-03-13 17:05:07 +00:00
Commented Mar 13, 2014 at 17:05
@RahulPatil see updated answer.

terdon
– terdon ♦

2014-03-13 17:41:16 +00:00
Commented Mar 13, 2014 at 17:41

| Show 4 more comments

cuonglm · Accepted Answer · 2015-07-14 03:16:36Z

4

Try this:

awk 'NR==1{print $1,$2,$3};NR!=1{if($2>a[$3]){a[$3]=$2;b[$3]=$1}}
    END{for(x in a){print b[x],a[x],x}}' OFS="\t" file

UserID  score   UserName
1234    200     Jack
8543    200     Jill
5678    100     John

Or using perl:

perl -ane '$h{$F[$#F]}=[$F[$#F-1],"$_"] if $F[$#F-1] > $h{$F[$#F]}->[0];
    END{print "$h{$_}->[1]" for keys %h}' file

edited Jul 14, 2015 at 3:16

answered Mar 13, 2014 at 16:48

cuonglm

158k41 gold badges342 silver badges420 bronze badges

Good awk solution unless Jack or John happen to have a negative score ;-) Also no guarantees that input order will be retained

iruvar
– iruvar

2014-03-13 17:17:29 +00:00
Commented Mar 13, 2014 at 17:17
Yeap, if that happened, we need one more check :).

cuonglm
– cuonglm

2014-03-13 17:22:28 +00:00
Commented Mar 13, 2014 at 17:22
keeping one array storing $0 would be a lot simpler

glenn jackman
– glenn jackman

2014-03-13 17:39:07 +00:00
Commented Mar 13, 2014 at 17:39

Add a comment |

jaypal singh · Accepted Answer · 2014-03-14 05:31:34Z

3

An alternate without arrays:

$ awk '
seen == $NF {line = (ishigh > $2) ? line : $0; next}
line {print line}
{seen = $NF; ishigh = $2; line = $0}
END {print line}' file
UserID score UserName
1234    200   Jack
8543    200   Jill
5678    100   John

answered Mar 14, 2014 at 5:31

jaypal singh

1,6221 gold badge14 silver badges17 bronze badges

Add a comment |

Stack Exchange Network

awk remove the rows based on another column value

3 Answers 3

You must log in to answer this question.

Hot Network Questions

awk remove the rows based on another column value

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions