How to remove a specific duplicate string from a field in a text file?

Question

I have a file with three columns, and I need to remove lines that contain specific duplicated field.

 1 V(Cl8)                         2.121
 2 V(C1,H3)                       2.067
 3 V(Cl7)                         2.121
 4 V(Cl7)                         1.347
 5 V(C4,H6)                       2.067
 6 V(Cl8)                         1.347
 7 V(Cl8)                         0.918
 8 V(C1,Cl7)                      1.220
 9 V(C4,Cl8)                      1.220
10 V(Cl7)                         0.918
11 V(C1,C4)                       1.958
12 C(Cl8)                         7.668
13 C(Cl7)                         7.668
14 C(C1)                          2.087
15 C(C4)                          2.087
16 C(Cl8)                         2.267
17 C(Cl7)                         2.267
18 V(C1,H2)                       2.067
19 V(Cl8)                         2.122
20 V(Cl7)                         2.122
21 V(C4,H5)                       2.067

I need to remove the lines that contain repeated C(Cl8) and C(Cl7), so that I only have one occurrence of each in the output.

I tried commands like sort and uniq, but all the duplicated strings are removed.

The desired output (note that I don't care which occurrence is kept, I only care that I have just one C(Cl8) and one C(Cl7):

 1 V(Cl8)                         2.121
 2 V(C1,H3)                       2.067
 3 V(Cl7)                         2.121
 4 V(Cl7)                         1.347
 5 V(C4,H6)                       2.067
 6 V(Cl8)                         1.347
 7 V(Cl8)                         0.918
 8 V(C1,Cl7)                      1.220
 9 V(C4,Cl8)                      1.220
10 V(Cl7)                         0.918
11 V(C1,C4)                       1.958
13 C(Cl7)                         7.668
14 C(C1)                          2.087
15 C(C4)                          2.087
16 C(Cl8)                         2.267
18 V(C1,H2)                       2.067
19 V(Cl8)                         2.122
20 V(Cl7)                         2.122
21 V(C4,H5)                       2.067

Let's see if I understand. So lines 12 and 13 (Cl18, Cl17) have repeated values in the 3rd column. In the output you remove Cl18. Yo want to keep Cl17 or it doesn't matter which one is removed? — schrodingerscatcuriosity
– schrodingerscatcuriosity, Commented Nov 2, 2021 at 10:41
Dear schrodigerscatcuriosity, what i'm looking for is search in the second column and find the duplicate C(Cl7) and C(Cl8) and remove the line. — anas forum
– anas forum, Commented Nov 2, 2021 at 10:44
@anasforum How do you decide which lines to keep/remove for values that appear more than once? In your example you keep line 16 and remove line 12 for C(Cl8), but keep line 13 and remove line 17 for C(Cl7). — DonHolgo
– DonHolgo, Commented Nov 2, 2021 at 10:57
See @DonHolgo comment above. We need to know how do you decide which of the duplicates to remove. Why keep line 16 and not 12? Why keep line 13 but not 17? — schrodingerscatcuriosity
– schrodingerscatcuriosity, Commented Nov 2, 2021 at 11:18

terdon · Accepted Answer · 2021-11-02 11:33:53Z

1

If you don't care about which of the duplicates is removed and are OK with keeping the first occurrence and removing the rest, you can use:

$ awk '/C\(Cl8\)/ && ++a > 1{next} /C\(Cl7\)/ && ++b > 1{next}1' file | color -l 'C\(Cl7\)','C\(Cl8\)'
 1 V(Cl8)                         2.121
 2 V(C1,H3)                       2.067
 3 V(Cl7)                         2.121
 4 V(Cl7)                         1.347
 5 V(C4,H6)                       2.067
 6 V(Cl8)                         1.347
 7 V(Cl8)                         0.918
 8 V(C1,Cl7)                      1.220
 9 V(C4,Cl8)                      1.220
10 V(Cl7)                         0.918
11 V(C1,C4)                       1.958
12 C(Cl8)                         7.668
13 C(Cl7)                         7.668
14 C(C1)                          2.087
15 C(C4)                          2.087
18 V(C1,H2)                       2.067
19 V(Cl8)                         2.122
20 V(Cl7)                         2.122
21 V(C4,H5)                       2.067

edited Nov 2, 2021 at 11:33

answered Nov 2, 2021 at 11:21

terdon♦

253k69 gold badges481 silver badges719 bronze badges

where is CCl8) and C(Cl7), they all disappeared ? i'm looking to keep one match

anas forum
– anas forum

2021-11-02 11:30:01 +00:00
Commented Nov 2, 2021 at 11:30
@anasforum eek! Sorry, copy/paste error, see updated answer.

terdon
– terdon ♦

2021-11-02 11:34:05 +00:00
Commented Nov 2, 2021 at 11:34
@terdon about your editing (not that I'm complaining), I didn't need to add the numerical sorting to have the result shown.

schrodingerscatcuriosity
– schrodingerscatcuriosity

2021-11-02 11:56:47 +00:00
Commented Nov 2, 2021 at 11:56
@schrodigerscatcuriosity maybe you have your locale set up differently (but that would be odd in this case) but without the -n, the data were sorted alphabetically so 10 was first and 9 was last and 1 was on line 9 etc.

terdon
– terdon ♦

2021-11-02 12:00:49 +00:00
Commented Nov 2, 2021 at 12:00
Is it possible to search the duplicate of C(Cl8) and C(Cl7) using a regular expression ?

anas forum
– anas forum

2021-11-03 16:36:03 +00:00
Commented Nov 3, 2021 at 16:36

| Show 2 more comments

terdon · Accepted Answer · 2021-11-02 11:45:24Z

Here's an option:

$ sort -k2,2 file | sed -e 'N;s/^\(.*C(Cl7).*\)\n.*C(Cl7).*/\1/' -e 's/^\(.*C(Cl8).*\)\n.*C(Cl8).*/\1/' | sort -nk1,1
 1 V(Cl8)                         2.121
 2 V(C1,H3)                       2.067
 3 V(Cl7)                         2.121
 4 V(Cl7)                         1.347
 5 V(C4,H6)                       2.067
 6 V(Cl8)                         1.347
 7 V(Cl8)                         0.918
 8 V(C1,Cl7)                      1.220
 9 V(C4,Cl8)                      1.220
10 V(Cl7)                         0.918
11 V(C1,C4)                       1.958
12 C(Cl8)                         7.668
13 C(Cl7)                         7.668
14 C(C1)                          2.087
15 C(C4)                          2.087
# 16 C(Cl8)                         2.267 removed
# 17 C(Cl7)                         2.267 removed
18 V(C1,H2)                       2.067
19 V(Cl8)                         2.122
20 V(Cl7)                         2.122
21 V(C4,H5)                       2.067

Thank You All Very Much.

anas forum
– anas forum

2021-11-02 11:49:17 +00:00
Commented Nov 2, 2021 at 11:49 — anas forum
– anas forum, Commented Nov 2, 2021 at 11:49

Stack Exchange Network

How to remove a specific duplicate string from a field in a text file?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to remove a specific duplicate string from a field in a text file?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions