How do I get a shell script to remove duplicates in a text file, based on the 11th 21st columns?

Question

Sample file:

Header:0000000000000001457854500000
XP        12345678912yeyeyeyeeye   0000003
XP        12345678913yeyeyeyeeye   0000002
XP        12345678912yeyeyeyeeye   0000004
XP        12345678913yeyeyeyeeye   0000001
Footer:0000000000000001245856500004

Expected output:

Header:0000000000000001457854500000
XP        12345678913yeyeyeyeeye   0000001
Xp        12345678912yeyeyeyeeye   0000004
Footer:0000000000000001245856500001

Which filter-requirement do you have exactly? The Task is unclear to me. — gerhard d.
– gerhard d., Commented Mar 19, 2019 at 8:00
so it's 11th, 21st, both, or all columns from 11th to 21st? also, if you only want to keep the last occurrence as you suggest in your comment, then please ask another question with the full requirements -- don't add details in comments and don't edit your question into something else. — user313992
– user313992, Commented Mar 19, 2019 at 8:50
Why the different footer in the expected output? Why are the two records reversed in the expected output? Header and Footer have the same 11th to 21th characters, shouldn't one of them be removed? If not, how to you differentiate them from the other lines? By looking for the Footer, Header strings? By looking at the number of fields? — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 19, 2019 at 9:17

Stéphane Chazelas · Accepted Answer · 2019-03-19 09:14:41Z

0

Based on your expected output, maybe something like:

awk 'NF <= 1 || !seen[substr($0, 11, 11)]++'

Or

awk 'NF <= 1 || !seen[substr($2, 1, 11)]++'

Or to keep the last record:

awk '!second_pass {if (NF > 1) count[substr($2, 1, 11)]++; next}
     NF <= 1 || --count[substr($2, 1, 11)] == 0' file second_pass=1 file

edited Mar 19, 2019 at 9:14

answered Mar 19, 2019 at 8:14

Stéphane Chazelas

587k96 gold badges1.1k silver badges1.7k bronze badges

Thanks and could you please explain in details how its working

Pratap
– Pratap

2019-03-19 08:24:34 +00:00
Commented Mar 19, 2019 at 8:24
Its working fine but i want to remove 1st record and keep it last record only, now its keeping 1st record and removing remaining duplicate records, Please assist on this

Pratap
– Pratap

2019-03-19 08:34:05 +00:00
Commented Mar 19, 2019 at 8:34
awk 'NF == 1 || !seen[substr($0, 11, 11)]++' it is working fine as expected but keep it last record and remove all remaining duplicate records,

Pratap
– Pratap

2019-03-19 08:54:25 +00:00
Commented Mar 19, 2019 at 8:54
@Pratap quick & dirty workaround: tac your_file | awk 'NF==1 || !seen[substr($0, 11, 11)]++' | tac

user313992
– user313992

2019-03-19 09:01:34 +00:00
Commented Mar 19, 2019 at 9:01

Add a comment |

Praveen Kumar BS · Accepted Answer · 2019-03-19 09:33:46Z

0

command:header=sed -n '1p' l.txt; footer=sed -n '$p' l.txt;sed -e '1d' -e '$d' l.txt |awk '{if (!seen[$2]++)print $0}'| sed '1i '$header''| sed '$s/.*/&\n'$footer'/g'

output

header=`sed -n '1p' l.txt`; footer=`sed -n '$p' l.txt`;sed -e '1d' -e '$d' l.txt |awk '{if (!seen[$2]++)print $0}'| sed '1i '$header''| sed '$s/.*/&\n'$footer'/g'

Header:0000000000000001457854500000
XP        12345678912yeyeyeyeeye   0000003
XP        12345678913yeyeyeyeeye   0000002
Footer:0000000000000001245856500004

answered Mar 19, 2019 at 9:33

Praveen Kumar BS

5,3212 gold badges12 silver badges16 bronze badges

Add a comment |

Stack Exchange Network

How do I get a shell script to remove duplicates in a text file, based on the 11th 21st columns?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How do I get a shell script to remove duplicates in a text file, based on the 11th 21st columns?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions