Extracting multiple strings and combining to .csv in bash

Question

I have a long list of ID's I need to parse. I want to extract three pieces of information and write to a 3 column CSV. Column 1 = the field between tr|XXXX|, column 3 = the field after the second | but before OS=.

Column 2 would be conditional. If there is 'GN=XXX' in the line, I'd like it to return XXX. If GN= isn't present, I'd like to write the first section of column 3 (i.e. up to the first space).

Input:

>tr|I1WXP1|I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment) OS=uncultured euryarchaeote OX=114243 GN=mcrA PE=4 SV=1
>tr|A0A059VAR9|A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment) OS=Halorubrum sp. Ga66 OX=1480727 GN=atpB PE=3 SV=1
>tr|Q51760|Q51760_9EURY Glutaredoxin-like protein OS=Pyrococcus furiosus OX=2261 PE=1 SV=1

Desired output:

I1WXP1,mcrA,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB, A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Q51760_9EURY Glutaredoxin-like protein

I can get the first two with awk, for e.g.:

awk '{split($0,a,"|"); print a[2]

But I can't work out the conditional, or how to act on the 'GN=' pattern neatly.

So for example, extracting bold text:

tr|**I1WXP1**|**I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)** OS=uncultured euryarchaeote OX=114243 GN=**mcrA** PE=4 SV=1

Becomes:

I1WXP1, mcrA, I1WXP1_9EURY Methyl coenzyme M reductase subunit A(Fragment)

Ed Morton · Accepted Answer · 2022-07-28 13:17:41Z

Whenever your input contains tag=value pairs as yours does I find it best to first create an array to contain that mapping and then you can access the values by their tags (names) however you like, e.g. using any awk:

$ cat tst.awk
BEGIN { FS="[|]"; OFS="," }
{
    delete tag2val

    description = $3; sub(/ +[^ ]+=.*/,"",description)
    assignments = substr($3,length(description)+1)

    tag2val["GN"] = description; sub(/ .*/,"",tag2val["GN"])

    split(assignments,a," ")
    for ( i in a ) {
        tag = a[i]; sub(/=.*/,"",tag)
        val = substr(a[i],length(tag)+2)
        tag2val[tag] = val
    }

    print $2, tag2val["GN"], description
}

$ awk -f tst.awk file
I1WXP1,mcrA,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB,A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Q51760_9EURY Glutaredoxin-like protein

With that approach if you want to print or test other fields it's trivial, e.g.:

$ cat tst.awk
BEGIN { FS="[|]"; OFS="," }
{
    delete tag2val

    description = $3; sub(/ +[^ ]+=.*/,"",description)
    assignments = substr($3,length(description)+1)

    tag2val["GN"] = description; sub(/ .*/,"",tag2val["GN"])

    split(assignments,a," ")
    for ( i in a ) {
        tag = a[i]; sub(/=.*/,"",tag)
        val = substr(a[i],length(tag)+2)
        tag2val[tag] = val
    }

    print $2, tag2val["GN"], tag2val["OS"], tag2val["PE"], description
}

$ awk -f tst.awk file
I1WXP1,mcrA,uncultured,4,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB,Halorubrum,3,A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Pyrococcus,1,Q51760_9EURY Glutaredoxin-like protein

M. Nejat Aydin · Accepted Answer · 2022-07-28 13:41:21Z

0

As an alternative in sed:

sed -e '/.*tr|\([^|]*\)|\(.*\) OS=.* GN=\([^ ]*\).*/{s//\1,\3,\2/;b;}' \
    -e 's/.*tr|\([^|]*\)|\(\([^ ]*\).*\) OS=.*/\1,\3,\2/'              \
file

answered Jul 28, 2022 at 13:41

M. Nejat Aydin

10.3k1 gold badge10 silver badges22 bronze badges

Comments

potong · Accepted Answer · 2022-07-28 15:07:57Z

0

This might work for you (GNU sed):

sed -En 's/^[^|]*\|([^|]*)\|(.*) OS.*GN=(\S+).*/\1,\3,\2/p;t
         s/^[^|]*\|([^|]*)\|((\S+).*) OS.*/\1,\3,\2/p' file

Use pattern matching and back references to format the desired result.

If the first match fails use the second.

If neither matches no output.

answered Jul 28, 2022 at 15:07

potong

59.3k6 gold badges55 silver badges92 bronze badges

Comments

ufopilot · Accepted Answer · 2022-07-28 18:42:50Z

GNU Awk

awk '
    BEGIN{FS="|"; OFS=","}
    /GN=/ { gn=gensub(/.* GN=([^ ]*) .*/,"\\1",1) }
    !/GN=/ { gn=gensub(/ .*/,"",1,$3) }
    { sub(/ OS=.*$/,""); print $2,gn,$3 }
' file

I1WXP1,mcrA,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB,A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Q51760_9EURY Glutaredoxin-like protein

awk '
    BEGIN{FS="|"; OFS=","}
    match($3,/(.*) OS=.* GN=([^ ]*) .*/,a) {
        print $2,a[2],a[1] 
        next
    }
    match($3,/^([^ ]*) (.*) OS=.*/,a){
        print $2,a[1],a[1]" "a[2]
    }
' file

I1WXP1,mcrA,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB,A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Q51760_9EURY Glutaredoxin-like protein

RavinderSingh13 · Accepted Answer · 2022-07-29 12:46:13Z

1st solution(GNU awk version): With your shown samples, please try following awk code. Written and tested with shown samples in GNU awk.

awk -F'^>tr\\|| OS=' '
BEGIN{ OFS="," }
NF>=2{
  gsub(/\|/,OFS,$2)
  match($0,/GN=(\S+)/,gnArray)
  if(gnArray[1]==""){
    match($2,/(^[^,]*),(\S+)(.*)/,NoGNfoundArr)
    val=NoGNfoundArr[1] OFS NoGNfoundArr[2] OFS NoGNfoundArr[2]  NoGNfoundArr[3]
  }
  else{
    match($2,/(^[^,]*),(.*)/,GNFoundArr)
    val=GNFoundArr[1] OFS gnArray[1] OFS GNFoundArr[2]
  }
  print val
}
'  Input_file

2nd solution(with any awk version): Adding additional awk code which should work with any version of it.

awk '
BEGIN{ OFS="," }
match($0,/^>tr.* OS=/){
  val1=substr($0,RSTART+4,RLENGTH-8)
  gsub(/\|/,",",val1)
  match($0,/GN=[^[:space:]]+/)
  val2=substr($0,RSTART+3,RLENGTH-3)
  if(val2){
    match(val1,/^[^,]*/)
    print substr(val1,RSTART,RLENGTH),val2 substr(val1,RSTART+RLENGTH)
  }
  else{
    match(val1,/,[^[:space:]]+/)
    print substr(val1,1,RSTART-1),substr(val1,RSTART+1,RLENGTH-1),substr(val1,RSTART+1,RLENGTH-1) substr(val1,RSTART+RLENGTH)
  }
}
'  Input_file

Collectives™ on Stack Overflow

Extracting multiple strings and combining to .csv in bash

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related