1

Below was my input file, but my actual input has millions of records,

004,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
004,[email protected],TAT,0562,live,20180622 06:27:59
004,[email protected],TAT,0582,inlive,20180622 06:27:47
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47

Firstly I would Like to sort the above file using Second column(email) ascending order, secondly I want to sort it using 6th column (timestamp) in descending order. Third, I need to remove the duplicate based second column.

Expected Output:

004,[email protected],TAT,0582,inlive,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0562,live,20180622 06:27:59
004,[email protected],TAT,0582,live,20180622 06:27:47

what I tried, but I want to do all in single command instead of different step also the duplicate removal wasn't happening properly with -u?

sort -t$'," -k2 pp.txt > pp1.txt
sort -t$'," -k6 -r pp1.txt > pp2.txt
sort -t$'," -k2 -u pp2.txt > pp3.txthere

Please help

4 Answers 4

2

Using gnu awk you can do this in a single command:

awk -F, 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc"}
!($2 in ts) || $6 > ts[$2] { ts[$2]=$6; row[$2]=$0 }
END { for (i in row) print row[i] }' file

004,[email protected],TAT,0582,inlive,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0562,live,20180622 06:27:59
004,[email protected],TAT,0582,live,20180622 06:27:47

Conditions !($2 in ts) || $6 > ts has 2 sub-conditions with OR clause. First condition means if $2 as key is not present in array named ts and 2nd condition means that if $2 is present then if current timestamp or $6 is greater than the one present in array (thus allowing us to store greatest timestamp for same vale of $2 in final array)

Sign up to request clarification or add additional context in comments.

4 Comments

could u please explain this !($2 in ts) || $6 > ts [ $2]{ts[$2]=$6; row[$2]=$0 }
!($2 in ts) || $6 > ts has 2 conditions with OR clause. First condition means if $2 as key is not present in array named ts and 2nd condition means that if $2 is present then if current timestamp is greater than the one present in array (thus allowing us to store greatest timestamp for same vale of $2
could you please tell me, which one would be faster in performance while handling millions of record using sort command or gnu awk?
sort alone won't give you output as you would pipe it to awk or uniq. This gnu awk is single command so I believe this would be faster and more efficient.
1

You should do it with this:

sort -t, -u -k2,2 pp.txt

and result is:

004,[email protected],TAT,0582,inlive,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47

1 Comment

hi thanks for the reply, but Firstly I would Like to sort the above file using Second column(email) ascending order, secondly I want to sort it using 6th column (timestamp) in descending order.. then want to remove the duplicate
0

Could you please try following and let me know if this helps you.

sort -t, -k2,2 -k6,6nr   Input_file | awk -F, '!a[$2]++'

6 Comments

@jcrshankar, please do let me know how your tr result came before shan here? and what you are expecting too please let me know will try to fix it then.
@jcrshankar, because you haven't copied it correctly ' you missed it, try complete command and let me know please? I had updated it.
@jcrshankar, I believe my code is working fine only, please check and confirm once.
Hi ravi, all working but its not taking the 6th column in descending order. i want the recent timestamp should present and the old time stamp should get removed in duplicate .. i want retain 004,[email protected],TAT,0562,live,20180622 06:27:59
@jcrshankar, yes syntax is correct, is it working for you?
|
0

GNU sort:

sort -t, -k2,2 -k6,6nr -u pp.txt

Output:

004,[email protected],TAT,0582,inlive,20180622 06:27:47
004,[email protected],TAT,0588,live,20180622 06:27:27
006,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47
004,[email protected],TAT,0582,live,20180622 06:27:47

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.