3

How can I replace multiple strings in one big file ( + 500K lines ) using a mapping file (+ 50K lines) ? The mapping file is structured like this :

A1  B1
A2  B2
A3  B3
..  ..

and the big file is structured like this :

A1  A2
A1  A3
A1  A8
A2  A1
A2  A3
A3  A10
A3  A13

and every string in the big file has to be replace using the mapping file.

Result wanted :

B1  B2
B1  B3
B1  B8
B2  B1
B2  B3
B3  B10
B3  B13

I tried using awk on every line of the mapping file but it takes a very very long time ... Here is the awk command. So I wrote a loop launching for each line of the mapping file an awk command, I save the results in a temporary file and use this result in a new awk with the next line of the mapping file ( not very efficient I know .. )

cat inputBigFile.txt | awk '{ gsub( "A1","B1" );}1' > out.txt

Thanks in advance

3
  • Precisely what awk command did you try that was too slow? Commented Apr 23, 2014 at 7:19
  • Search for one of the many near-duplicates where the answer explains how to use NR==FNR. Commented Apr 23, 2014 at 7:35
  • 2
    Anyway, you should not cat data to programs that can read it itself, like awk. awk '{ gsub( "A1","B1" );}1' inputBigFile.txt > out.txt. To see how long time program uses, start it with time eks: time awk 'code file > out` Commented Apr 23, 2014 at 7:39

1 Answer 1

5
$ awk 'NR==FNR{map[$1]=$2;next} {if($1 in map)$1=map[$1]; if($2 in map)$2=map[$2]}1' mappings file
B1
B1
B1 A8
B2
B2
B3 A10
B3 A13

I assume specifically checking and replacing the two columns to be faster than a loop over NF and/or using gsub.

EDIT: It significantly is:

$ wc -l file
8388608 file

.

$ time awk 'NR==FNR{map[$1]=$2;next} {if($1 in map)$1=map[$1]; if ($2 in map)$2=map[$2]}1' mappings file >/dev/null
real    0m6.941s
user    0m6.904s
sys     0m0.016s

.

$ time awk 'NR==FNR{map[$1]=$2;next} {for(i=1;i<=NF;i++)$i=($i in map)?map[$i]:$i}1' mappings file >/dev/null
real    0m10.311s
user    0m10.249s
sys     0m0.036s

.

$ awk --version | head -n 1
GNU Awk 3.1.8
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks that's perfect. But you forgot an $ in your awk command (in $2=map[2]) : awk 'NR==FNR{map[$1]=$2;next} {if($1 in map)$1=map[$1]; if($2 in map)$2=map[$2]}1' mappings file
One thing : how can I force to output in tab delimited ?
@NicoBxl Use awk -v OFS='\t' [...] to set the output delimiter. However, this will only affect lines where at least one of the columns have been changed. To force, you can add an explicit print $1, $2 after the if (and drop the trailing 1), e.g. [...]; if($2 in map)$2=map[$2]; print $1, $2}.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.