0

I'm trying to replace substring in a text file with some other substrings using sed For example,

sed 's/dogs chase/<bop> dogs chase <eop>/g; s/birds eat/<bop> birds eat <eop>'/g corpus.txt

So instead of dogs chase in corpus.txt, I replace it with <bop> dogs chase <eop>, birds eat with <bop> birds eat <eop>.

Suppose I have all the substrings in a textfile sub.txt and I want to use to replace the text in the corpus.txt file, is there a way I can have my command to work .e.g.

dogs chase
birds eat
chase birds
chase cat

sed 's/dogs chase/<bop> dogs chase <eop>/g; s/chase birds/<bop> chase birds <eop>/g; s/chase cat/<bop> chase cat <eop>/g; s/birds eat/<bop> birds eat <eop>'/g corpus.txt

The sed command would replace dogs chase with <bop> dogs chase <eop>, birds eat with <bop> birds eat </eop>, chase birds with <bop> chase birds <eop>, chase cat and <bop> chase cat <eop>. The hand crafted command would be difficult to write if the sub.txt contains 100s of the substring.

Note the corpus.txt file

dogs chase cats around
dogs bark
cats meow
dogs chase birds
cats chase birds , birds eat grains
dogs chase the cats
the birds chirp

The desired output:

<bop> dogs chase <eop> cats around
dogs bark
cats meow
<bop> dogs chase <eop> birds 
cats <bop> chase birds <eop> , <bop> birds eat <eop> grains
<bop> dogs chase <eop> the cats
the birds chirp
0

1 Answer 1

2

With GNU sed and bash:

sed -f <(sed 's/.*/s|&|<bop> & <eop>|g/' sub.txt) corpus.txt

Output:

<bop> dogs chase <eop> cats around
dogs bark
cats meow
<bop> dogs chase <eop> birds
cats <bop> chase birds <eop> , <bop> birds eat <eop> grains
<bop> dogs chase <eop> the cats
the birds chirp
Sign up to request clarification or add additional context in comments.

4 Comments

If I change the order of the sub.txt with chase cat coming first. It changes the output with the first line becoming dogs <bop> chase cat<eop>s around. Do you think I can do something to prevent such from happening? As what I am trying to do is a kind of bi-gram matching.
This might help. \b marks a word boundary: sed -f <(sed 's/.*/s|\\b&\\b|<bop> & <eop>|g/' sub.txt) corpus.txt
Hi @Cyrus, I'm sorry I'm disturbing you again. I added chase birds . to sub.txt and the 5th line in the corpus.txt becomes cats <bop> <bop> chase birds . <eop>eop> , <bop> birds eat <eop> grains which is not suppose to be so. What can I do?
I suggest to start a new question with these requirements.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.