Replace text with values

Question

I have two files that need to be merged into one.

File 1 example:

gene_1  578
gene_2  565
gene_3  3
gene_4  77
gene_5  8
gene_6  0
gene_7  45
gene_8  67
gene_9  0
gene_10 65

File2 example:

COG0430 gene_5 gene_9       
COG1949 gene_1 gene_3 gene_6
COG5049 gene_2 gene_4 gene_7 gene_10
COG5104 gene_8

The output file should look like this:

COG0430 8 0 
COG1949 578 3 0
COG5049 565 77 45 65
COG5104 67

Does anyone know a command that can solve this problem?

Vi Pau · Accepted Answer · 2018-07-06 13:53:08Z

0

#!/bin/bash
declare -A arr
readarray -t lines < "file1"

for line in "${lines[@]}"; do
   arr[${line%% *}]=${line#* }
done

readarray -t lines2 < "file2"

for line in "${lines2[@]}"; do
    echo -n "${line%% *} "
    for word in $line; do
        echo -n "${arr[$word]} "
    done
    echo
done

Not the cleanest bash but it works. Also make sure you have bash >= 4.2

answered Jul 6, 2018 at 13:53

Vi Pau

1666 bronze badges

Add a comment |

steve · Accepted Answer · 2018-07-06 13:52:46Z

Here's one way

awk '/^gene/{a[$1]=$2}/^COG/{c=$1;for(b=1;b<=NF;b++){c=sprintf("%s%s%c",c,a[$b],b==NF?"":" ")}print c}' file1 file2
COG0430 8 0
COG1949 578 3 0
COG5049 565 77 45 65
COG5104 67

/gene/{a[$1]=$2} looks for any lines with "gene" at the start, and creates an array a item, with a key of the first column (e.g. "gene_1") and a value of the next column (e.g. "578")
/^COG/ looks for any lines with "COG" at the start...
c=$1 sets variable c to the first column, e.g. "COG0430"
{c=sprintf("%s%s%c",c,a[$b],b==NF?"":" " keeps appending the array entry for each column into variable c. If it's not the last column, throw in a space delimiter.
print c then just prints the fully formed variable "c"

ctac_ · Accepted Answer · 2018-07-06 16:50:20Z

0

You can try this awk

awk '
  NR == FNR {
    a[$1] = $2
    next
  }
  {
    for ( i = 2 ; i <= NF ; i++)
    $i = a[$i]
  }
1' file1 file2

or on one line

awk 'NR==FNR{a[$1]=$2;next}{for(i=2;i<=NF;i++)$i=a[$i]}1' file1 file2

answered Jul 6, 2018 at 16:50

ctac_

1,9781 gold badge9 silver badges14 bronze badges

Add a comment |

Rakesh Sharma · Accepted Answer · 2018-07-08 07:57:08Z

 perl -ale '
    $h{$F[0]}=$F[1],next if @ARGV;
    my $k;
    print s/\H+/$k++ ? $h{$&} : $&/reg;
 ' file1 file2

° Reading the first file, @ARGV holds the 2nd argument and hence returns true.

° Populate a hash %h with keys as gene names and the corresponding values from the second field, for each line of file1.

° for the second file, @ARGV shall hold nothing, and hence will return a false. The last two lines of code will be executed for each line of file2.

° Initialize the count variable each time a line of file2 is read. Then the \H+ shall match a nonhorizontal whitespace run of characters, iow, a field. And on the 2nd onwards the subs from gene name => gene number is triggered.

The sed editor with Gnu extensions can also do it:

 sed -Ee '
     # store file1 in hold
    /^C/!{H;1h;d;}

    # place a traveling marker \n\n at $2
    s/$/ /
    G
    s/(\S+\s+)/&\n\n/

    # effect gene name => gene number 
    :a
       s/\n\n(\S+)[ ]+((.*\n)?\1\s+([0-9]+))/ \4\n\n\2/
    ta

   # take away marker and hold portion
    s/\n\n.*//
 ' file1 file2

Older perls don't imply -n from -a and it must be specified. Since we already need -a, for the second file I'd use print shift @F,map{" ".$h{$_}} @F — dave_thompson_085
– dave_thompson_085, Commented Jul 8, 2018 at 16:11

Stack Exchange Network

Replace text with values

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Replace text with values

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions