2

Searched for similar problems and could not find anything that suits my needs exactly:

I have a very large HTML file scraped from multiple websites and I would like to replace all

class="key->from 2nd file"

with

style="xxxx"

At the moment I use sed - it works well but only with small files

while read key; do sed -i "s/class=\"$key\"/style=\"xxxx\"/g" file_to_process; done < keys

When I'm trying to process something larger it takes ages

Example:

keys - Count: 1233 lines
file_to_ process - Count: 1946 lines

It takes about 40 s to complete only 1/10 of processing I need

real    0m40.901s
user    0m8.181s
sys     0m15.253s
2
  • It would help to include sample data from keys in your message. Good luck. Commented Oct 30, 2012 at 15:06
  • '40 s to complete only 1/10'... So now that it's more than 400s later, your job is done, right? Commented Oct 30, 2012 at 15:10

5 Answers 5

2

Untested since you didn't provide any sample input and expected output:

awk '
NR==FNR { keys = keys sep $0; sep = "|"; next }
{ gsub("class=\"(" keys ")\"","style=\"xxxx\"") }
1' keys file_to_process > tmp$$ &&
mv tmp$$ file_to_process
Sign up to request clarification or add additional context in comments.

1 Comment

wow thanks so much this is super fast real 0m0.273s user 0m0.264s sys 0m0.008s
1

I think it's time to Perl (untested):

my $keyfilename = 'somekeyfile'; // or pick up from script arguments
open KEYFILE, '<', $keyfilename or die("Could not open key file $keyfilename\n");
my %keys = map { $_ => 1 } <KEYFILE>; // construct a map for lookup speed
close KEYFILE;

my $htmlfilename = 'somehtmlfile'; // or pick up from script arguments
open HTMLFILE, '<', $htmlfilename or die("Could not open html file $htmlfilename\n");
my $newchunk = qq/class="xxxx"/;
for  my $line (<$htmlfile>) {
    my $newline = $line;
    while($line =~ m/(class="([^"]+)")/) {
        if(defined($keys{$2}) {
            $newline =~ s/$1/$newchunk/g;
        }
    }
    print $newline;
}

This uses a hash for lookups of keys, which should be reasonably fast, and does this only on the key itself when the line contains a class statement.

Comments

0

Try to generate a very long sed script with all sub commands from the keys file, something like:

s/class=\"key1\"/style=\"xxxx\"/g; s/class=\"key2\"/style=\"xxxx\"/g ...

and use this file. This way you will read the input file only once.

Comments

0

Here's one way using GNU awk:

awk 'FNR==NR { array[$0]++; next } { for (i in array) { a = "class=\"" i "\""; gsub(a, "style=\"xxxx\"") } }1' keys.txt file.txt

Note that the keys in keys.txt are taken as the whole line, including whitespace. If leading and lagging whitespace could be a problem, use $1 instead of $0. Unfortunately I cannot test this properly without some sample data. HTH.

1 Comment

thanks steve that works like a charm but seems slower than my sed real 1m19.551s user 1m18.853s sys 0m0.020s
0

First convert your keys file into a sed or-pattern which looks like this: key1|key2|key3|.... This can be done using the tr command. Once you have this pattern, you can use it in a single sed command.

Try the following:

sed -i -r  "s/class=\"($(tr '\n' '|' < keys | sed 's/|$//'))\"/style=\"xxxx\"/g" file

2 Comments

The tr will append a | to the end of the text from keys so instead of "key1|key2" you'll end up with "key1|key2|". Not sure what sed will make of that but it's probably not what you want.
I just tested and sed will treat that as if the start of a line matches the set of keys so you'll end up with every line starting with style="xxxx".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.