0

I have a file with three columns, and I need to remove lines that contain specific duplicated field.

 1 V(Cl8)                         2.121
 2 V(C1,H3)                       2.067
 3 V(Cl7)                         2.121
 4 V(Cl7)                         1.347
 5 V(C4,H6)                       2.067
 6 V(Cl8)                         1.347
 7 V(Cl8)                         0.918
 8 V(C1,Cl7)                      1.220
 9 V(C4,Cl8)                      1.220
10 V(Cl7)                         0.918
11 V(C1,C4)                       1.958
12 C(Cl8)                         7.668
13 C(Cl7)                         7.668
14 C(C1)                          2.087
15 C(C4)                          2.087
16 C(Cl8)                         2.267
17 C(Cl7)                         2.267
18 V(C1,H2)                       2.067
19 V(Cl8)                         2.122
20 V(Cl7)                         2.122
21 V(C4,H5)                       2.067

I need to remove the lines that contain repeated C(Cl8) and C(Cl7), so that I only have one occurrence of each in the output.

I tried commands like sort and uniq, but all the duplicated strings are removed.

The desired output (note that I don't care which occurrence is kept, I only care that I have just one C(Cl8) and one C(Cl7):

 1 V(Cl8)                         2.121
 2 V(C1,H3)                       2.067
 3 V(Cl7)                         2.121
 4 V(Cl7)                         1.347
 5 V(C4,H6)                       2.067
 6 V(Cl8)                         1.347
 7 V(Cl8)                         0.918
 8 V(C1,Cl7)                      1.220
 9 V(C4,Cl8)                      1.220
10 V(Cl7)                         0.918
11 V(C1,C4)                       1.958
13 C(Cl7)                         7.668
14 C(C1)                          2.087
15 C(C4)                          2.087
16 C(Cl8)                         2.267
18 V(C1,H2)                       2.067
19 V(Cl8)                         2.122
20 V(Cl7)                         2.122
21 V(C4,H5)                       2.067
8
  • it's not a number, it's atomic symbol of Chlorine (Cl) Commented Nov 2, 2021 at 10:29
  • Let's see if I understand. So lines 12 and 13 (Cl18, Cl17) have repeated values in the 3rd column. In the output you remove Cl18. Yo want to keep Cl17 or it doesn't matter which one is removed? Commented Nov 2, 2021 at 10:41
  • Dear schrodigerscatcuriosity, what i'm looking for is search in the second column and find the duplicate C(Cl7) and C(Cl8) and remove the line. Commented Nov 2, 2021 at 10:44
  • 1
    @anasforum How do you decide which lines to keep/remove for values that appear more than once? In your example you keep line 16 and remove line 12 for C(Cl8), but keep line 13 and remove line 17 for C(Cl7). Commented Nov 2, 2021 at 10:57
  • 1
    See @DonHolgo comment above. We need to know how do you decide which of the duplicates to remove. Why keep line 16 and not 12? Why keep line 13 but not 17? Commented Nov 2, 2021 at 11:18

2 Answers 2

1

If you don't care about which of the duplicates is removed and are OK with keeping the first occurrence and removing the rest, you can use:

$ awk '/C\(Cl8\)/ && ++a > 1{next} /C\(Cl7\)/ && ++b > 1{next}1' file | color -l 'C\(Cl7\)','C\(Cl8\)'
 1 V(Cl8)                         2.121
 2 V(C1,H3)                       2.067
 3 V(Cl7)                         2.121
 4 V(Cl7)                         1.347
 5 V(C4,H6)                       2.067
 6 V(Cl8)                         1.347
 7 V(Cl8)                         0.918
 8 V(C1,Cl7)                      1.220
 9 V(C4,Cl8)                      1.220
10 V(Cl7)                         0.918
11 V(C1,C4)                       1.958
12 C(Cl8)                         7.668
13 C(Cl7)                         7.668
14 C(C1)                          2.087
15 C(C4)                          2.087
18 V(C1,H2)                       2.067
19 V(Cl8)                         2.122
20 V(Cl7)                         2.122
21 V(C4,H5)                       2.067
7
  • where is CCl8) and C(Cl7), they all disappeared ? i'm looking to keep one match Commented Nov 2, 2021 at 11:30
  • @anasforum eek! Sorry, copy/paste error, see updated answer. Commented Nov 2, 2021 at 11:34
  • @terdon about your editing (not that I'm complaining), I didn't need to add the numerical sorting to have the result shown. Commented Nov 2, 2021 at 11:56
  • @schrodigerscatcuriosity maybe you have your locale set up differently (but that would be odd in this case) but without the -n, the data were sorted alphabetically so 10 was first and 9 was last and 1 was on line 9 etc. Commented Nov 2, 2021 at 12:00
  • Is it possible to search the duplicate of C(Cl8) and C(Cl7) using a regular expression ? Commented Nov 3, 2021 at 16:36
1

Here's an option:

$ sort -k2,2 file | sed -e 'N;s/^\(.*C(Cl7).*\)\n.*C(Cl7).*/\1/' -e 's/^\(.*C(Cl8).*\)\n.*C(Cl8).*/\1/' | sort -nk1,1
 1 V(Cl8)                         2.121
 2 V(C1,H3)                       2.067
 3 V(Cl7)                         2.121
 4 V(Cl7)                         1.347
 5 V(C4,H6)                       2.067
 6 V(Cl8)                         1.347
 7 V(Cl8)                         0.918
 8 V(C1,Cl7)                      1.220
 9 V(C4,Cl8)                      1.220
10 V(Cl7)                         0.918
11 V(C1,C4)                       1.958
12 C(Cl8)                         7.668
13 C(Cl7)                         7.668
14 C(C1)                          2.087
15 C(C4)                          2.087
# 16 C(Cl8)                         2.267 removed
# 17 C(Cl7)                         2.267 removed
18 V(C1,H2)                       2.067
19 V(Cl8)                         2.122
20 V(Cl7)                         2.122
21 V(C4,H5)                       2.067
1
  • Thank You All Very Much. Commented Nov 2, 2021 at 11:49

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.