1

I'm reading file in chunks using binmode() and wanted to strip out byte values that match any value in a static list

@strip = (91,   92,   98,   107,   5,   64,   21,   13,   11,   12)

what I'm doing in my script

binmode($fh);
read($fh,$data,20);
%strip = (91=>1, 92=>1,98=>1,107=>1,5=>1,64=>1,21=>,13=>1,11=>1,12=>1); 
$data=~s/(.)/$strip{ord($1)} ? "" :$1/ge

I'm afraid, doing it regex way might be incorrect and have some undesirable results.

Can someone suggest alternative ways that is cleaner and efficient to achieve it

1 Answer 1

3

The regex engine is perfectly happy to operate on strings of bytes (though using \d and such may not make any sense), so your approach is perfectly fine. But white quite efficient, it can be sped up.

What if we used chr on the bytes to strip rather than using ord on all the characters read?

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my %to_strip = map { chr($_) => 1 } @to_strip;

$data =~ s/(.)/ $strip{$1} ? "" :$1 /ge;

What if we took it a step further, and made the replacement choice even sooner?

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my %to_strip = map { chr($_) => 1 } @to_strip;
my %map = map { $to_strip{$_} ? "" : $_ } map chr, 0x00..0xFF;

$data =~ s/(.)/$map{$1}/sg;

But we're still doing a lot of needless replacements. What if we search for the specific character we want to replace?

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my $pat = "[" . quotemeta( pack( 'C*', @to_strip ) ) . "]+";
my $re = qr/$pat/;

$data =~ s/$re//g;

This one is much faster for three reasons:

  • As previously mentioned, we greatly reduced the number of matches, which reduces the number of times the replacement expression needs to be evaluated and concatenated.
  • The regex engine can check for matching characters far faster than our Perl code can.
  • We eliminated the need for captures, which are (relatively speaking) quite slow.

Remember that @to_strip, %to_strip, %map, $pat and $re only need to be calculated once, not once per read. When I talked about speed above, I wasn't including the time needed to calculate these, since I assumed you will be doing multiple reads and replaces.


That said, if it's reasonable to hardcode the bytes to remove, tr///d will give you the best performance.

$data =~ tr/\x05\x0B-\x0D\x15\x40\x5B\x5C\x62\x6B//d;

It's not effective to use tr/// from a dynamic list because tr/// doesn't interpolate. We have to resort to building a sub, and invoking a sub is relatively slow.

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my $class = quotemeta( pack( 'C*', @to_strip ) );
my $inline_stripper = eval("sub { $_[0] =~ tr/$class//d; }");

$inline_stripper->($data);

The following is an efficient (but surely not as efficient) non-regex approach.

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my @to_strip_lookup; $to_strip_lookup[$_] = 1 for @to_strip;

$data = pack 'C*', grep !$to_strip_lookup[$_], unpack 'C*', $data
Sign up to request clarification or add additional context in comments.

9 Comments

thank you for the answer.. will not using chr() on a binmode string change characters
Re "will not using chr() on a binmode string change characters", huh? None of the solutions use chr on a string.
just curious.. this is something like A minus B set operation, so can it be done with 2 arrays of byte values?.
You could consider the list of bytes to strip a set, but not the data because the values of sets are unique and unordered. But you aren't wrong. The non-regex solution is I've just added is basically exactly how I'd do a set difference (if the elements of the sets started in arrays): my %B = map { $_ => 1 } @B; my @AminusB = grep !$B{$_}, @A;
list2re from Data::Munge can nicely wrap the logic of creating the regex (though I don't know if the resulting regex will be as efficient as a character class in this case).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.