1

I have two separate files, each containing a different number of columns which I want to merge based on data in multiple columns.

file1

VMNF01000015.1  1769465 1769675 .   .   -   Focub_II5_mimp_1
VMNF01000014.1  3225875 3226081 .   .   +   Focub_II5_mimp_1
VMNF01000014.1  3226046 3226081 .   .   -   Focub_II5_mimp_1
VMNF01000014.1  3585246 3585281 .   .   -   Focub_II5_mimp_1
VMNF01000014.1  3692468 3692503 .   .   -   Focub_II5_mimp_1
VMNF01000014.1  3715380 3715415 .   .   +   Focub_II5_mimp_1
VMNF01000014.1  2872478 2872511 .   .   -   Focub_II5_mimp_1

file2

VMNF01000014.1  3225875-3226081(+)  gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3226046-3226081(-)  tacacacctgcgaatactttttgcatcccactgta
VMNF01000015.1  1769465-1769675(-)  gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3692468-3692503(-)  tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1  3715380-3715415(+)  gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3585246-3585281(-)  tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1  2872478-2872511(-)  gtacttcagcctggattcaaacttattgcatcccactgta

First, I think I need to create another 2 columns in file2, separating numbers by "-" and creating a new column for "(*)", but I cannot work out how to separate the numbers without replacing "(-)" too. So far I have been using this command:

awk '{gsub("-","\t",$2);print;}'

Once this has been done, I would like to add the last column in file2 to file1. I have been able to do this using the following command:

awk 'NR==FNR {a[$1]=$3; next} {print $1,$2,$3,$4,$5,$6,$7,a[$1];}' file2 file1 > file3. 

However, the data does not match. It is matched based on the entry in column 1. The data in column 1 is the same in many instances, so the data in column 8 of file3 only matches one of the entries, and doesn't match the data in column 2 or 3 in file1 e.g.

file3:

VMNF01000015.1  1769465 1769675 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3225875 3226081 .   .   +   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3226046 3226081 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3585246 3585281 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3692468 3692503 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3715380 3715415 .   .   +   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  2872478 2872511 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta

Even if I was able to separate the data in column 2 of file2, I would still have the same problem as the data in column 2 is the same in some instances. What I need is code that says something along the lines of: sperate the data in column 2 (see below);

VMNF01000014.1  3225875    3226081    (+)   gtacttcagcctggattcaaacttattgcatcccactgta

then:

if $1,$2,$3 in file1 match $1,$2,$3 in file2, print $1,$2,$3,$4,$5,$6,$7 from file1 and add $5 from file2.

How can I do this? I know that awk can use if statements, but I don't know how to use them in awk.

Any advice?

2 Answers 2

2

Could you please try following.

awk '
FNR==NR{
  split($2,array,"[-(]")
  mainarray[$1,array[1],array[2]]=$NF
  next
}
(($1,$2,$3) in mainarray){
  print $0,mainarray[$1,$2,$3]
}
'  Input_file2  Input_file1

2nd solution: Since OP is getting an error in above code so made a little change in above.

awk '
FNR==NR{
  split($2,array,"[-(]")
  key=$1 OFS array[1] OFS array[2]
  mainarray[key]=$NF
  next
}
{ key = $1 OFS $2 OFS $3 }
(key in mainarray){
  print $0,mainarray[key]
}
'  Input_file2  Input_file1

Explanation: Adding detailed explanation for above code.

awk '                                       ##Starting awk program from here.
FNR==NR{                                    ##Checking condition FNR==NR when  Input_file2 is being read.
  split($2,array,"[-(]")                    ##Splitting 2nd field into an array named array where delimiter is - OR (
  mainarray[$1,array[1],array[2]]=$NF       ##Creating mainarray index of $1,array[1],array[2] and value is current line is last field.
  next                                      ##next will skip all further statements from here.
}
(($1,$2,$3) in mainarray){                  ##Checking condition if $1,$2,$3 of current line is present in mainaarray.
  print $0,mainarray[$1,$2,$3]              ##Printing current line with value of mainarray with index of $1,$2,$3
}
'  Input_file2  Input_file1                 ##Mentioning Input_file names here.
Sign up to request clarification or add additional context in comments.

1 Comment

I have noticed that in some instances the (-) or (+) are also required for ensuring the correct sequence in column 8, is there a way to include this too? Can it be done by adding another array to key?
2
$ awk '
    { key=$1 OFS $2 OFS $3 }
    NR==FNR { map[key]=$NF; next }
    { print $0, map[key] }
' FS='[[:space:](-]+' file2 FS=' ' file1
VMNF01000015.1  1769465 1769675 .   .   -   Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3225875 3226081 .   .   +   Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3226046 3226081 .   .   -   Focub_II5_mimp_1 tacacacctgcgaatactttttgcatcccactgta
VMNF01000014.1  3585246 3585281 .   .   -   Focub_II5_mimp_1 tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1  3692468 3692503 .   .   -   Focub_II5_mimp_1 tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1  3715380 3715415 .   .   +   Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  2872478 2872511 .   .   -   Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.