Remove nearly duplicate lines

Question

I've got a knotty problem that I can't figure out how to solve.

I have a text file containing a few million lines of text. Basically I want to run uniq, but with a twist: If two lines are identical but for a :FOO suffix, drop the line that lacks the suffix. But only if the lines are otherwise identical. And only for :FOO, not any other possible suffix. do not want to drop /usr/bin/delta:FOO, because the line above isn't identical.

red.7
green.2
green.2:FOO
blue.6
yellow.9:FOO

I want to delete green.2, because the line below is identical but with a suffix. All other lines should be retained unchanged.

[Edit: I forgot to mention, the file is already in sorted order.]

My thoughts so far:

Obviously uniq is the tool to do this.
You can make uniq ignore a prefix, but never a suffix. (This is extremely annoying!)
I thought perhaps you could pretend that : is a field separator, and get cut (together with paste) to flip the field order. But no, it is apparently impossible to force cut to output a blank line if no separator is present.
My next thought is to go through line by line and output a 1-character prefix depending on the presence or absence of the suffix... but I can't imagine scripting that as a Bash loop being performant.

Any hints?

I may end up just using a real programming language to fix this. It seems simple enough to fix in Bash, but I've already wasted quite a lot of time failing to get it to work...

do you want to keep the version with :FOO, the one without it or either? Can you have identical lines that don't have :FOO and, if so, what should be done with those? — terdon
– terdon ♦, Commented May 6, 2016 at 12:49
@terdon There shouldn't be any lines which are exactly identical. [I should probably go check that though... if there are, something has gone horribly wrong!] I want to keep the longer line with the :FOO suffix. — MathematicalOrchid
– MathematicalOrchid, Commented May 6, 2016 at 12:54
GNU uniq has --check-chars, which enables you to ignore suffixes, too. — Michael Vehrs
– Michael Vehrs, Commented May 6, 2016 at 12:58
@MichaelVehrs Seems it doesn't let you ignore the last N characters; it only lets you say "process only the first N characters". — MathematicalOrchid
– MathematicalOrchid, Commented May 6, 2016 at 13:05
@MathematicalOrchid Correct, just like --skip-chars enables you to ignore the first N characters. — Michael Vehrs
– Michael Vehrs, Commented May 6, 2016 at 13:07

Faheem Mitha · Accepted Answer · 2016-05-06 16:56:07Z

5

In the simplest case, to keep the lines without :FOO, you could just remove :FOO and then pass through uniq:

$ sed 's/:FOO$//' file | uniq
red.7
green.2
blue.6
yellow.9

If you prefer to keep the :FOO lines and assuming that they always come after their non-suffixed brethren, you could try:

$ rev file | sed 's/:/ /' | uniq -f1 | sed 's/ /:/' | rev
red.7
green.2:FOO
blue.6
yellow.9:FOO

rev prints each line from right to left. The sed replaces the first : with a space so uniq can use recognize FOO (or OOF, in this case) as the 1st field that should be ignored, the next sed puts the : back and the final rev prints out left to right again.

Unfortunately, and despite what its documentation claims, uniq doesn't only use space and tab as a field delimiter, but pretty much any non-alphanumeric character:

$ printf 'foo/1\nfoo/2\nfoo%%3\nfoo:4\n' 
foo/1
foo/2
foo%3
foo:4
$ printf 'foo/1\nfoo/2\nfoo%%3\nfoo:4\n'  | uniq -f1
foo/1

This means the solution above won't work if you have such characters. As an alternative, you could grep for all instances of :FOO in your file, remove the :FOO and feed the result to a new grep as a list of patterns to avoid:

$ grep -hFxv "$(grep ':FOO' file | cut -d: -f1)" file 
red.7
green.2:FOO
blue.6
yellow.9:FOO

edited May 6, 2016 at 16:56

Faheem Mitha

36.1k33 gold badges130 silver badges190 bronze badges

answered May 6, 2016 at 12:57

terdon♦

253k69 gold badges481 silver badges719 bronze badges

I like this answer. But for some reason, it doesn't appear to work on my actual data, and I can't figure out why...

MathematicalOrchid
– MathematicalOrchid

2016-05-06 13:35:00 +00:00
Commented May 6, 2016 at 13:35
@MathematicalOrchid can you edit your question with an example of lines it fails for? There's probably something specific to your data. Alternatively, ping me (@terdon) in /dev/chat and we can debug it there.

terdon
– terdon ♦

2016-05-06 13:37:43 +00:00
Commented May 6, 2016 at 13:37
It goes wrong at the uniq step. Try with the following lines: /Foo, /Foo/Bar, /Foo/Bar/Baz. After uniq, only ooF/ remains.

MathematicalOrchid
– MathematicalOrchid

2016-05-06 13:46:25 +00:00
Commented May 6, 2016 at 13:46
1

@MathematicalOrchid please come into /dev/chat. This solution depends on having :, like in your example, what you show above is completely different. Come into chat and we can look into it.

terdon
– terdon ♦

2016-05-06 13:47:36 +00:00
Commented May 6, 2016 at 13:47

Add a comment |

Gilles 'SO- stop being evil' · Accepted Answer · 2016-05-07 18:23:51Z

5

One way in awk:

awk '$0 != x ":FOO" && NR>1 {print x} {x=$0} END {print}' file

Saves the line, then checks at the start of every line that it doesn't contain the saved string + :FOO. Print last line as it can't possibly have the next line have :FOO as there is none.

edited May 7, 2016 at 18:23

Gilles 'SO- stop being evil'

866k205 gold badges1.8k silver badges2.3k bronze badges

answered May 6, 2016 at 14:18

123

1,5527 silver badges9 bronze badges

+1. nice, but try this version instead: awk '! match($0,x"(:FOO)?$") && NR > 1 {print x} {x=$0} END {print}. This version gets rid of ordinary (non- :FOO) duplicate lines too (e.g. two blue.6 lines).

cas
– cas

2016-05-06 15:01:14 +00:00
Commented May 6, 2016 at 15:01

Add a comment |

steeldriver · Accepted Answer · 2016-05-06 14:36:56Z

4

How about joining adjacent pairs of lines, and then using a backreference to find the non-unique prefix?

$ sed '$!N; /\(.*\)\n\1:FOO/D; P;D' file
red.7
green.2:FOO
blue.6
yellow.9:FOO

Explanation:

$!N - if we are not already at the last line, append the next line to the pattern space, separated by a newline
/$.*$\n - match everything up to the newline (i.e. the first of each pair of lines) and save it into a capture group
\1:FOO now matches whatever was captured from the first line, followed by :FOO (\1 is a backreference to the first capture group)
/$.*$\n\1:FOO/D - if the second line of each pair is the same as the first followed by :FOO, then Delete the first
Print and Delete the remaining line ready to start the next cycle

or neater (thanks @don_crissti)

 sed '$!N; /$.*$\n\1:FOO/!P;D' file
N means there are always two consecutive lines in the pattern space and sed Prints the first one of them only if the second line isn't the same as the first one plus the suffix :FOO. Then D removes the first line from the pattern space and restarts the cycle.

edited May 6, 2016 at 14:36

answered May 6, 2016 at 12:53

steeldriver

83.9k12 gold badges124 silver badges175 bronze badges

Sorry @don_crissti maybe the coffee hasn't kicked in yet but I don't understand. The first D is deleting the (near) duplicate line, while the second D is to make sure we keep a rolling buffer of only 2 lines... or so I thought?

steeldriver
– steeldriver

2016-05-06 13:05:59 +00:00
Commented May 6, 2016 at 13:05
I have literally no idea what on Earth this does, but it works. Very slow, but it works.

MathematicalOrchid
– MathematicalOrchid

2016-05-06 14:05:26 +00:00
Commented May 6, 2016 at 14:05
@MathematicalOrchid This should be quick, how big is your file ?

123
– 123

2016-05-06 14:10:03 +00:00
Commented May 6, 2016 at 14:10
@123 The file is about 72MB.

MathematicalOrchid
– MathematicalOrchid

2016-05-06 15:50:53 +00:00
Commented May 6, 2016 at 15:50

Add a comment |

Stack Exchange Network

Remove nearly duplicate lines

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Remove nearly duplicate lines

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions