sorting and remove duplicate in unix from file

Question

Below was my input file, but my actual input has millions of records,

004,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
004,[email protected],TAT,0562,live,20180622 06:27:59
004,[email protected],TAT,0582,inlive,20180622 06:27:47
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47

Firstly I would Like to sort the above file using Second column(email) ascending order, secondly I want to sort it using 6th column (timestamp) in descending order. Third, I need to remove the duplicate based second column.

Expected Output:

004,[email protected],TAT,0582,inlive,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0562,live,20180622 06:27:59
004,[email protected],TAT,0582,live,20180622 06:27:47

what I tried, but I want to do all in single command instead of different step also the duplicate removal wasn't happening properly with -u?

sort -t$'," -k2 pp.txt > pp1.txt
sort -t$'," -k6 -r pp1.txt > pp2.txt
sort -t$'," -k2 -u pp2.txt > pp3.txthere

Please help

anubhava · Accepted Answer · 2018-06-26 11:39:39Z

2

Using gnu awk you can do this in a single command:

awk -F, 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc"}
!($2 in ts) || $6 > ts[$2] { ts[$2]=$6; row[$2]=$0 }
END { for (i in row) print row[i] }' file

004,[email protected],TAT,0582,inlive,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0562,live,20180622 06:27:59
004,[email protected],TAT,0582,live,20180622 06:27:47

Conditions !($2 in ts) || $6 > ts has 2 sub-conditions with OR clause. First condition means if $2 as key is not present in array named ts and 2nd condition means that if $2 is present then if current timestamp or $6 is greater than the one present in array (thus allowing us to store greatest timestamp for same vale of $2 in final array)

edited Jun 26, 2018 at 11:39

answered Jun 26, 2018 at 11:17

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jcrshankar Over a year ago

could u please explain this !($2 in ts) || $6 > ts [ $2]{ts[$2]=$6; row[$2]=$0 }

anubhava Over a year ago

!($2 in ts) || $6 > ts has 2 conditions with OR clause. First condition means if $2 as key is not present in array named ts and 2nd condition means that if $2 is present then if current timestamp is greater than the one present in array (thus allowing us to store greatest timestamp for same vale of $2

jcrshankar Over a year ago

could you please tell me, which one would be faster in performance while handling millions of record using sort command or gnu awk?

anubhava Over a year ago

sort alone won't give you output as you would pipe it to awk or uniq. This gnu awk is single command so I believe this would be faster and more efficient.

user987339 · Accepted Answer · 2018-06-26 10:21:23Z

1

You should do it with this:

sort -t, -u -k2,2 pp.txt

and result is:

004,[email protected],TAT,0582,inlive,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47

answered Jun 26, 2018 at 10:21

user987339

10.7k8 gold badges45 silver badges48 bronze badges

1 Comment

jcrshankar Over a year ago

hi thanks for the reply, but Firstly I would Like to sort the above file using Second column(email) ascending order, secondly I want to sort it using 6th column (timestamp) in descending order.. then want to remove the duplicate

RavinderSingh13 · Accepted Answer · 2018-06-26 10:45:49Z

0

Could you please try following and let me know if this helps you.

sort -t, -k2,2 -k6,6nr   Input_file | awk -F, '!a[$2]++'

edited Jun 26, 2018 at 10:45

answered Jun 26, 2018 at 10:35

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

6 Comments

RavinderSingh13 Over a year ago

@jcrshankar, please do let me know how your tr result came before shan here? and what you are expecting too please let me know will try to fix it then.

RavinderSingh13 Over a year ago

@jcrshankar, because you haven't copied it correctly ' you missed it, try complete command and let me know please? I had updated it.

RavinderSingh13 Over a year ago

@jcrshankar, I believe my code is working fine only, please check and confirm once.

jcrshankar Over a year ago

Hi ravi, all working but its not taking the 6th column in descending order. i want the recent timestamp should present and the old time stamp should get removed in duplicate .. i want retain 004,[email protected],TAT,0562,live,20180622 06:27:59

RavinderSingh13 Over a year ago

@jcrshankar, yes syntax is correct, is it working for you?

|

agc · Accepted Answer · 2018-06-26 19:36:02Z

0

GNU sort:

sort -t, -k2,2 -k6,6nr -u pp.txt

Output:

004,[email protected],TAT,0582,inlive,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47

answered Jun 26, 2018 at 19:36

agc

8,5342 gold badges33 silver badges53 bronze badges

Collectives™ on Stack Overflow

sorting and remove duplicate in unix from file

4 Answers 4

4 Comments

1 Comment

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related