Just for fun, I tried an in-place replacement bash script, myreplace.
Obviously, do not use this without saving your original data first and
doing extensive testing. It
might have problems with files over 4G bytes, though as the numbers are
over 32 bits. Also, if there are millions of matches, tac is going to use
up memory or temporary file space.
I also had to hack up a small perl script to do a seek(2), but there must be one somewhere already.
#!/bin/bash
# https://unix.stackexchange.com/q/784361/119298
file=${1?}
str1=utf8mb4_0900_ai_ci
str2=utf8mb4_unicode_520_ci
len1=${#str1}
len2=${#str2}
let len3=len2-len1
if [ "$len3" -lt 0 ]
then echo "bad len $len3. dont need this script"; exit 1
fi
echo "2nd str bigger by $len3"
# grep -c counts lines so ignores 2 matches on a line, not what we want
nummatches=$(grep -a -o -b -F "$str1" "$file" | wc -l)
let need=nummatches*len3
echo "$nummatches matches, need $need bytes"
filesize=$(stat --format=%s "$file")
echo "filesize $filesize"
let src=filesize
let dest=filesize+need
let i=nummatches
# open 2 filedescriptors on same file, to read from and write at
exec {fdr}<"$file" {fdw}<>"$file"
seek <&$fdr $src; seek <&$fdw $dest # seek to both eofs
blocksize=10240 # arbitrary optimisation
# move overlapping from,to,numbytes
domove(){
local from=${1?} to=${2?} numbytes=${3?} partlen
while [ $numbytes -gt 0 ]
do if [ $numbytes -gt $blocksize ]
then partlen=$blocksize
else partlen=$numbytes
fi
seek <&$fdr -$partlen; seek <&$fdw -$partlen
dd <&$fdr >&$fdw ibs=$partlen count=1 iflag=fullblock status=none
seek <&$fdr -$partlen; seek <&$fdw -$partlen
let numbytes=numbytes-partlen
done
seek <&$fdw -$len2
printf "%s" "$str2" >&$fdw
seek <&$fdw -$len2
seek <&$fdr -$len1
}
grep -a -o -b -F "$str1" "$file" |
sed 's/:.*//' |
tac |
while read offset
do echo "match $i at src $offset"
let tomove="src-(offset+len1)"
echo "move all from $offset+$len1 .. $src ($tomove bytes) to $dest-$tomove"
echo "insert $len2 bytes of 2nd string to $dest-$tomove-$len2"
echo "skip back over $len1 bytes of 1st string"
domove $(($offset+$len1)) $(($dest-$tomove)) $tomove
let src=$offset
let dest=dest-tomove-len2
let i=i-1
done
The principle is to use grep to find the byte offsets of the matches,
then use tac to reverse this list so we start at the end.
We open 2 file descriptors on the file. fdr will be our current reading
position, and fdw our write position. They both start at the end of the
file, but fdw is at the new notional end, which is further on by
nummatches times len3, the difference in length of the replacement string.
We use function domove to seek back on the reader by an amount, seek back
on the writer by the same amount, read and copy the amount to the writer.
We then need to seek back again to our new positions.
We seek back in the reader to skip over the old string. On the writer we
seek back, write the replacement string, and seek back over it.
I created a demo file to test with (str1 is from the script):
file=/tmp/myfile
man bash | sed 's/ brace / '"$str1"' /g' >"$file"
cp "$file" /tmp/orig
./myreplace "$file"
diff -u /tmp/orig "$file"
My perl is a bit rusty, but here's the perl script "seek":
#!/usr/bin/perl
# seek on stdin to given position
use strict;
use Fcntl 'SEEK_SET','SEEK_CUR';
sub usage{
printf STDERR "usage: [+|-]9999 where sign means relative\n";
exit 1;
}
my $offset = shift @ARGV;
my $flag = SEEK_SET;
my $sign = 1;
if($offset =~ s/^-//){$flag = SEEK_CUR; $sign = -1;}
elsif($offset =~ s/^\+//){$flag = SEEK_CUR;}
if($offset!~/^\d+$/){ usage(); }
usage() if(scalar @ARGV!=0);
$offset *= $sign;
if(!seek(STDIN,$offset,$flag)){
printf STDERR "failed to seek to $offset: $!\n";
exit 2;
}
number of instances of found string * byte delta?zcat file.gz | sed ... > uncompressed.filefor example?