2

This is linked to another question/code-golf i asked on Code golf: "Color highlighting" of repeated text

I've got a file 'sample1.txt' with the following content:

LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

I've got a script generating the following array of strings which occur in the file (only a few shown for illustration):

LoremIpsum
LoremIpsu
dummytext
oremIpsum
LoremIps
dummytex
industry
oremIpsu
remIpsum
ummytext
LoremIp
dummyte
emIpsum
industr
mmytext

I need to (from the top) see if 'LoremIpsum' occurs in file sample1.txt. If so, I want to replace all occurences of LoremIpsum with: <T1>LoremIpsum</T1>. Now, when the program moves to the next word 'LoremIpsu', it should NOT match against the <T1>LoremIpsum</T1> text inside sample1.txt. It should repeat the above for all elements of this 'array'. The next 'valid' one would be 'dummytext' and that should be tagged as <T2>dummytext</T2> .

I do think it should be possible to create a bash shell script solution for this rather than relying on perl/python/ruby programs.

4
  • It sounds like a job for sed, but the question is not clear to me. Commented Jul 10, 2010 at 7:18
  • Hi Marco - does the T2 example help? Commented Jul 10, 2010 at 7:20
  • 1
    Why do you want to use shell script? Why not use whichever tool is best for the job? Perl was MADE for low-programmer-time text processing. Commented Jul 10, 2010 at 7:27
  • I have a shell script running which generates the list which you see above. I would love to continue using one framework rather than mixing-n-matching, but sure - i'll go for a perl solution as well... The perl program SHOULD accept the list from the script output!! Commented Jul 10, 2010 at 7:48

2 Answers 2

1

Pure Bash (no externals)

At the Bash command line:

$ sample="LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook."
$ # or: sample=$(<sample1.txt)
$ array=(
LoremIpsum
LoremIpsu
dummytext
...
)
$ tag=0; for entry in ${array[@]}; do test="<[^>/]*>[^>]*$entry[^<]*</"; if [[ ! $sample =~ $test ]]; then ((tag++)); sample=${sample//${entry}/<T$tag>$entry</T$tag>}; fi; done; echo "Output:"; echo $sample
Output:
<T1>LoremIpsum</T1>issimply<T2>dummytext</T2>oftheprintingandtypesetting<T3>industry</T3>.<T1>LoremIpsum</T1>hasbeenthe<T3>industry</T3>'sstandard<T2>dummytext</T2>eversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.
Sign up to request clarification or add additional context in comments.

Comments

0

Straightforward with Perl:

#! /usr/bin/perl

use warnings;
use strict;

my @words = qw/
  LoremIpsum
  LoremIpsu
  dummytext
  oremIpsum
  LoremIps
  dummytex
  industry
  oremIpsu
  remIpsum
  ummytext
  LoremIp
  dummyte
  emIpsum
  industr
  mmytext
/;

my $to_replace = qr/@{[ join "|" =>
                        sort { length $b <=> length $a }
                        @words
                     ]}/;

my $i = 0;
while (<>) {
  s|($to_replace)|++$i; "<T$i>$1</T$i>"|eg;
  print;
}

Sample run (wrapped to prevent horizontal scrolling):

$ ./tag-words sample.txt
<T1>LoremIpsum</T1>issimply<T2>dummytext</T2>oftheprintingandtypesetting<T3>indus
try</T3>.<T4>LoremIpsum</T4>hasbeenthe<T5>industry</T5>'sstandard<T6>dummytext</T
6>eversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatyp
especimenbook.

You may object that all the qr// and @{[ ... ]} business is on the baroque side. One could get the same effect with the /o regular-expression switch as in

# plain scalar rather than a compiled pattern
my $to_replace = join "|" =>
                 sort { length $b <=> length $a }
                 @words;

my $i = 0;
while (<>) {
  # o at the end for "compile (o)nce"
  s|($to_replace)|++$i; "<T$i>$1</T$i>"|ego;
  print;
}

1 Comment

Hi gbacon - umm - the second replacement should be "T2", third - "T3"... just fyi - i know its a minor change for your code

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.