0

I'm trying to replace substrings in a text file [corpus.txt] with some other substrings using sed. I have the list of possible substrings in a file sub.txt containing the following:

dogs chase
birds eat
chase birds
chase cat
chase birds .

and a corpus.txt containing some texts as below:

dogs chase cats around
dogs bark
cats meow
dogs chase birds
cats chase birds , birds eat grains
dogs chase the cats
the birds chirp

with the desired output

<bop> dogs chase <eop> cats around
dogs bark
cats meow
<bop> dogs chase <eop> birds 
cats <bop> chase birds <eop> , <bop> birds eat <eop> grains
<bop> dogs chase <eop> the cats
the birds chirp

Using the Command sed -f <(sed 's/.*/s|\\b&\\b|<bop> & <eop>|g/' sub.txt) corpus.txt it returns everything in the desired output correctly, except in the fifth line where it returns :

cats <bop> <bop> chase birds . <eop>eop> , <bop> birds eat <eop> grains

What can I do to get this to work?

6
  • you're asking for it, your first file has two chase birds. Perhaps pass it from uniq to eliminate duplicates. Commented Jun 25, 2020 at 19:42
  • Hi karakfa, they aren't duplicates. I have chase birds and chase birds . Commented Jun 25, 2020 at 19:44
  • OK, do you agree that if chase birds . matches, chase birds matches as well. And the first one will match any char due to . being a special char. So both matches takes place. If you want a literal match, escape . with \ in your sub.txt file Commented Jun 25, 2020 at 19:49
  • I don't think I would agree that if chase birds . matches, chase birds matches as well. In the worst case I expect it to match just chase birds . Commented Jun 25, 2020 at 19:52
  • Second one is a subset of the first one. It has to match by definition. However, the problem here is if you want a literal . match you have to escape it. Commented Jun 25, 2020 at 19:53

1 Answer 1

2

you have to escape the . in the first file to make a literal match

$ sed -f <(sed 's/\./\\./;s/.*/s|\\b&\\b|<bop> & <eop>|g/' sub_o.txt) file

<bop> dogs chase <eop> cats around
dogs bark
cats meow
<bop> dogs chase <eop> birds
cats <bop> chase birds <eop> , <bop> birds eat <eop> grains
<bop> dogs chase <eop> the cats
the birds chirp
Sign up to request clarification or add additional context in comments.

7 Comments

That works fine @karakfa, I added dogs chase-@@ boy to the corpus and changed the seed command to sed -f <(sed 's/\./\\./;s/\-/\\-/;s/.*/s|\\b&\\b|<bop> & <eop>|g/' sub.txt) corpus.txt by trying to escape - but it seems not to work.
hyphen is not special unless in square brackets. no need to escape.
Yes, but I get <bop> dogs chase <eop>@@- boy instead of just having dogs chase@@- boy as it is.
As it should. @ creates a word boundary (similar to white space) and it matches the rest as you specified.
Oh! I see.. Is there a way I can ensure I get dogs chase@@- boy from the sed command?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.