perl - remove bytes from a binmode string by looking up an array

Question

I'm reading file in chunks using binmode() and wanted to strip out byte values that match any value in a static list

@strip = (91,   92,   98,   107,   5,   64,   21,   13,   11,   12)

what I'm doing in my script

binmode($fh);
read($fh,$data,20);
%strip = (91=>1, 92=>1,98=>1,107=>1,5=>1,64=>1,21=>,13=>1,11=>1,12=>1); 
$data=~s/(.)/$strip{ord($1)} ? "" :$1/ge

I'm afraid, doing it regex way might be incorrect and have some undesirable results.

Can someone suggest alternative ways that is cleaner and efficient to achieve it

ikegami · Accepted Answer · 2019-03-27 01:48:01Z

3

The regex engine is perfectly happy to operate on strings of bytes (though using \d and such may not make any sense), so your approach is perfectly fine. But white quite efficient, it can be sped up.

What if we used chr on the bytes to strip rather than using ord on all the characters read?

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my %to_strip = map { chr($_) => 1 } @to_strip;

$data =~ s/(.)/ $strip{$1} ? "" :$1 /ge;

What if we took it a step further, and made the replacement choice even sooner?

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my %to_strip = map { chr($_) => 1 } @to_strip;
my %map = map { $to_strip{$_} ? "" : $_ } map chr, 0x00..0xFF;

$data =~ s/(.)/$map{$1}/sg;

But we're still doing a lot of needless replacements. What if we search for the specific character we want to replace?

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my $pat = "[" . quotemeta( pack( 'C*', @to_strip ) ) . "]+";
my $re = qr/$pat/;

$data =~ s/$re//g;

This one is much faster for three reasons:

As previously mentioned, we greatly reduced the number of matches, which reduces the number of times the replacement expression needs to be evaluated and concatenated.
The regex engine can check for matching characters far faster than our Perl code can.
We eliminated the need for captures, which are (relatively speaking) quite slow.

Remember that @to_strip, %to_strip, %map, $pat and $re only need to be calculated once, not once per read. When I talked about speed above, I wasn't including the time needed to calculate these, since I assumed you will be doing multiple reads and replaces.

That said, if it's reasonable to hardcode the bytes to remove, tr///d will give you the best performance.

$data =~ tr/\x05\x0B-\x0D\x15\x40\x5B\x5C\x62\x6B//d;

It's not effective to use tr/// from a dynamic list because tr/// doesn't interpolate. We have to resort to building a sub, and invoking a sub is relatively slow.

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my $class = quotemeta( pack( 'C*', @to_strip ) );
my $inline_stripper = eval("sub { $_[0] =~ tr/$class//d; }");

$inline_stripper->($data);

The following is an efficient (but surely not as efficient) non-regex approach.

my @to_strip = ( 5, 11, 12, 13, 21, 64, 91, 92, 98, 107 );
my @to_strip_lookup; $to_strip_lookup[$_] = 1 for @to_strip;

$data = pack 'C*', grep !$to_strip_lookup[$_], unpack 'C*', $data

edited Mar 27, 2019 at 1:48

answered Mar 26, 2019 at 19:24

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

stack0114106 Over a year ago

thank you for the answer.. will not using chr() on a binmode string change characters

ikegami Over a year ago

Re "will not using chr() on a binmode string change characters", huh? None of the solutions use chr on a string.

stack0114106 Over a year ago

just curious.. this is something like A minus B set operation, so can it be done with 2 arrays of byte values?.

ikegami Over a year ago

You could consider the list of bytes to strip a set, but not the data because the values of sets are unique and unordered. But you aren't wrong. The non-regex solution is I've just added is basically exactly how I'd do a set difference (if the elements of the sets started in arrays): my %B = map { $_ => 1 } @B; my @AminusB = grep !$B{$_}, @A;

Grinnz Over a year ago

list2re from Data::Munge can nicely wrap the logic of creating the regex (though I don't know if the resulting regex will be as efficient as a character class in this case).

|

Collectives™ on Stack Overflow

perl - remove bytes from a binmode string by looking up an array

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related