0

I have a multi-gigabyte text file and I want to replace all occurrences of utf8mb4_0900_ai_ci in it with utf8mb4_unicode_520_ci.

Usually, I’d use sed -i for this as suggested here: find and replace a string in a file without using temp file with SED

However, this creates a temp file under the hood and I need this replacement to occur in an environment that won’t have the disk space to support that.

How can I modify the file in-place?

16
  • 3
    Is the replacement string of the same length (in bytes) as the string to be replaced? Commented Oct 2, 2024 at 2:49
  • Nope, I've edited the question to clarify Commented Oct 2, 2024 at 8:28
  • An alternative would be to do the replacement as I unzip the file in the first place I suppose? Commented Oct 2, 2024 at 8:47
  • 1
    @ilkkachu - Could you not calculate how far it needs to move by basically doing number of instances of found string * byte delta? Commented Oct 2, 2024 at 9:53
  • 1
    If it is gzipped, please mention that in the question! Do you need to keep the file compressed or is decompressing it part of your pipeline? Can you just zcat file.gz | sed ... > uncompressed.file for example? Commented Oct 2, 2024 at 10:29

1 Answer 1

1

Just for fun, I tried an in-place replacement bash script, myreplace. Obviously, do not use this without saving your original data first and doing extensive testing. It might have problems with files over 4G bytes, though as the numbers are over 32 bits. Also, if there are millions of matches, tac is going to use up memory or temporary file space. I also had to hack up a small perl script to do a seek(2), but there must be one somewhere already.

#!/bin/bash
# https://unix.stackexchange.com/q/784361/119298
file=${1?}
str1=utf8mb4_0900_ai_ci
str2=utf8mb4_unicode_520_ci

len1=${#str1}
len2=${#str2}
let len3=len2-len1
if [ "$len3" -lt 0 ]
then echo "bad len $len3. dont need this script"; exit 1
fi
echo "2nd str bigger by $len3"

# grep -c counts lines so ignores 2 matches on a line, not what we want
nummatches=$(grep -a -o -b -F "$str1" "$file" | wc -l)
let need=nummatches*len3
echo "$nummatches matches, need $need bytes"
filesize=$(stat --format=%s "$file")
echo "filesize $filesize"
let src=filesize
let dest=filesize+need
let i=nummatches

# open 2 filedescriptors on same file, to read from and write at
exec {fdr}<"$file" {fdw}<>"$file"
seek <&$fdr $src; seek <&$fdw $dest # seek to both eofs

blocksize=10240 # arbitrary optimisation
# move overlapping from,to,numbytes
domove(){
    local from=${1?} to=${2?} numbytes=${3?} partlen
    while [ $numbytes -gt 0 ]
    do  if [ $numbytes -gt $blocksize ]
        then    partlen=$blocksize
        else    partlen=$numbytes
        fi
        seek <&$fdr -$partlen; seek <&$fdw -$partlen
        dd <&$fdr >&$fdw ibs=$partlen count=1 iflag=fullblock status=none
        seek <&$fdr -$partlen; seek <&$fdw -$partlen
        let numbytes=numbytes-partlen
    done
    seek <&$fdw -$len2
    printf "%s" "$str2" >&$fdw
    seek <&$fdw -$len2
    seek <&$fdr -$len1
}

grep -a -o -b -F "$str1" "$file" |
sed 's/:.*//' |
tac |
while read offset
do  echo "match $i at src $offset"
    let tomove="src-(offset+len1)"
    echo "move all from $offset+$len1 .. $src ($tomove bytes) to $dest-$tomove"
    echo "insert $len2 bytes of 2nd string to $dest-$tomove-$len2"
    echo "skip back over $len1 bytes of 1st string"
    domove $(($offset+$len1)) $(($dest-$tomove)) $tomove
    let src=$offset
    let dest=dest-tomove-len2
    let i=i-1
done

The principle is to use grep to find the byte offsets of the matches, then use tac to reverse this list so we start at the end. We open 2 file descriptors on the file. fdr will be our current reading position, and fdw our write position. They both start at the end of the file, but fdw is at the new notional end, which is further on by nummatches times len3, the difference in length of the replacement string.

We use function domove to seek back on the reader by an amount, seek back on the writer by the same amount, read and copy the amount to the writer. We then need to seek back again to our new positions.

We seek back in the reader to skip over the old string. On the writer we seek back, write the replacement string, and seek back over it.

I created a demo file to test with (str1 is from the script):

file=/tmp/myfile
man bash | sed 's/ brace / '"$str1"' /g' >"$file"
cp "$file" /tmp/orig
./myreplace "$file"
diff -u /tmp/orig "$file"

My perl is a bit rusty, but here's the perl script "seek":

#!/usr/bin/perl
# seek on stdin to given position
use strict;
use Fcntl 'SEEK_SET','SEEK_CUR';
sub usage{
    printf STDERR "usage: [+|-]9999  where sign means relative\n";
    exit 1;
}
my $offset = shift @ARGV;
my $flag = SEEK_SET;
my $sign = 1;
if($offset =~ s/^-//){$flag = SEEK_CUR; $sign = -1;}
elsif($offset =~ s/^\+//){$flag = SEEK_CUR;}
if($offset!~/^\d+$/){ usage(); }
usage() if(scalar @ARGV!=0);
$offset *= $sign;
if(!seek(STDIN,$offset,$flag)){
    printf STDERR "failed to seek to $offset: $!\n";
    exit 2;
}

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.