Efficient way to replace strings in one file with strings from another file

Question

Searched for similar problems and could not find anything that suits my needs exactly:

I have a very large HTML file scraped from multiple websites and I would like to replace all

class="key->from 2nd file"

with

style="xxxx"

At the moment I use sed - it works well but only with small files

while read key; do sed -i "s/class=\"$key\"/style=\"xxxx\"/g" file_to_process; done < keys

When I'm trying to process something larger it takes ages

Example:

keys - Count: 1233 lines
file_to_ process - Count: 1946 lines

It takes about 40 s to complete only 1/10 of processing I need

real    0m40.901s
user    0m8.181s
sys     0m15.253s

It would help to include sample data from keys in your message. Good luck. — shellter
– shellter, Commented Oct 30, 2012 at 15:06
'40 s to complete only 1/10'... So now that it's more than 400s later, your job is done, right? — Phil H
– Phil H, Commented Oct 30, 2012 at 15:10

Ed Morton · Accepted Answer · 2012-10-30 15:11:10Z

2

Untested since you didn't provide any sample input and expected output:

awk '
NR==FNR { keys = keys sep $0; sep = "|"; next }
{ gsub("class=\"(" keys ")\"","style=\"xxxx\"") }
1' keys file_to_process > tmp$$ &&
mv tmp$$ file_to_process

answered Oct 30, 2012 at 15:11

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hexdump Over a year ago

wow thanks so much this is super fast real 0m0.273s user 0m0.264s sys 0m0.008s

Phil H · Accepted Answer · 2012-10-30 15:25:43Z

I think it's time to Perl (untested):

my $keyfilename = 'somekeyfile'; // or pick up from script arguments
open KEYFILE, '<', $keyfilename or die("Could not open key file $keyfilename\n");
my %keys = map { $_ => 1 } <KEYFILE>; // construct a map for lookup speed
close KEYFILE;

my $htmlfilename = 'somehtmlfile'; // or pick up from script arguments
open HTMLFILE, '<', $htmlfilename or die("Could not open html file $htmlfilename\n");
my $newchunk = qq/class="xxxx"/;
for  my $line (<$htmlfile>) {
    my $newline = $line;
    while($line =~ m/(class="([^"]+)")/) {
        if(defined($keys{$2}) {
            $newline =~ s/$1/$newchunk/g;
        }
    }
    print $newline;
}

This uses a hash for lookups of keys, which should be reasonably fast, and does this only on the key itself when the line contains a class statement.

Eran Ben-Natan · Accepted Answer · 2012-10-30 15:05:45Z

0

Try to generate a very long sed script with all sub commands from the keys file, something like:

s/class=\"key1\"/style=\"xxxx\"/g; s/class=\"key2\"/style=\"xxxx\"/g ...

and use this file. This way you will read the input file only once.

answered Oct 30, 2012 at 15:05

Eran Ben-Natan

2,6152 gold badges19 silver badges19 bronze badges

Comments

Steve · Accepted Answer · 2012-10-30 15:05:48Z

0

Here's one way using GNU awk:

awk 'FNR==NR { array[$0]++; next } { for (i in array) { a = "class=\"" i "\""; gsub(a, "style=\"xxxx\"") } }1' keys.txt file.txt

Note that the keys in keys.txt are taken as the whole line, including whitespace. If leading and lagging whitespace could be a problem, use $1 instead of $0. Unfortunately I cannot test this properly without some sample data. HTH.

answered Oct 30, 2012 at 15:05

Steve

55.1k13 gold badges94 silver badges105 bronze badges

1 Comment

hexdump Over a year ago

thanks steve that works like a charm but seems slower than my sed real 1m19.551s user 1m18.853s sys 0m0.020s

dogbane · Accepted Answer · 2012-10-30 15:52:06Z

0

First convert your keys file into a sed or-pattern which looks like this: key1|key2|key3|.... This can be done using the tr command. Once you have this pattern, you can use it in a single sed command.

Try the following:

sed -i -r  "s/class=\"($(tr '\n' '|' < keys | sed 's/|$//'))\"/style=\"xxxx\"/g" file

edited Oct 30, 2012 at 15:52

answered Oct 30, 2012 at 15:11

dogbane

276k77 gold badges407 silver badges415 bronze badges

2 Comments

Ed Morton Over a year ago

The tr will append a | to the end of the text from keys so instead of "key1|key2" you'll end up with "key1|key2|". Not sure what sed will make of that but it's probably not what you want.

Ed Morton Over a year ago

I just tested and sed will treat that as if the start of a line matches the set of keys so you'll end up with every line starting with style="xxxx".

Collectives™ on Stack Overflow

Efficient way to replace strings in one file with strings from another file

5 Answers 5

1 Comment

Comments

Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related