0

How do I get a shell script to remove duplicates in a text file, based on the 11th 21st columns?

Sample file:

Header:0000000000000001457854500000
XP        12345678912yeyeyeyeeye   0000003
XP        12345678913yeyeyeyeeye   0000002
XP        12345678912yeyeyeyeeye   0000004
XP        12345678913yeyeyeyeeye   0000001
Footer:0000000000000001245856500004

Expected output:

Header:0000000000000001457854500000
XP        12345678913yeyeyeyeeye   0000001
Xp        12345678912yeyeyeyeeye   0000004
Footer:0000000000000001245856500001
4
  • any answer here on above question Commented Mar 19, 2019 at 7:49
  • Which filter-requirement do you have exactly? The Task is unclear to me. Commented Mar 19, 2019 at 8:00
  • so it's 11th, 21st, both, or all columns from 11th to 21st? also, if you only want to keep the last occurrence as you suggest in your comment, then please ask another question with the full requirements -- don't add details in comments and don't edit your question into something else. Commented Mar 19, 2019 at 8:50
  • Why the different footer in the expected output? Why are the two records reversed in the expected output? Header and Footer have the same 11th to 21th characters, shouldn't one of them be removed? If not, how to you differentiate them from the other lines? By looking for the Footer, Header strings? By looking at the number of fields? Commented Mar 19, 2019 at 9:17

2 Answers 2

0

Based on your expected output, maybe something like:

awk 'NF <= 1 || !seen[substr($0, 11, 11)]++'

Or

awk 'NF <= 1 || !seen[substr($2, 1, 11)]++'

Or to keep the last record:

awk '!second_pass {if (NF > 1) count[substr($2, 1, 11)]++; next}
     NF <= 1 || --count[substr($2, 1, 11)] == 0' file second_pass=1 file
4
  • Thanks and could you please explain in details how its working Commented Mar 19, 2019 at 8:24
  • Its working fine but i want to remove 1st record and keep it last record only, now its keeping 1st record and removing remaining duplicate records, Please assist on this Commented Mar 19, 2019 at 8:34
  • awk 'NF == 1 || !seen[substr($0, 11, 11)]++' it is working fine as expected but keep it last record and remove all remaining duplicate records, Commented Mar 19, 2019 at 8:54
  • @Pratap quick & dirty workaround: tac your_file | awk 'NF==1 || !seen[substr($0, 11, 11)]++' | tac Commented Mar 19, 2019 at 9:01
0

command:header=sed -n '1p' l.txt; footer=sed -n '$p' l.txt;sed -e '1d' -e '$d' l.txt |awk '{if (!seen[$2]++)print $0}'| sed '1i '$header''| sed '$s/.*/&\n'$footer'/g'

output

header=`sed -n '1p' l.txt`; footer=`sed -n '$p' l.txt`;sed -e '1d' -e '$d' l.txt |awk '{if (!seen[$2]++)print $0}'| sed '1i '$header''| sed '$s/.*/&\n'$footer'/g'

Header:0000000000000001457854500000
XP        12345678912yeyeyeyeeye   0000003
XP        12345678913yeyeyeyeeye   0000002
Footer:0000000000000001245856500004

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.