0

I have a tab delimited data frame with a final column containing nested information that is '|' delimited. Note that all rows maintain this nested parenthetical structure preceded by 'REP='

col1    col2    col3    col4
ID1     text    text    text...REP=(info1|info2|info3)
ID2     text    text    text...REP=(info1|info2|info3)

I would like to process this last column such that all info inside the parenthetical is a new column:

col1    col2    col3    col4   newcol    newcol2    newcol3
ID1     text    text    text   info1     info2      info3
ID2     text    text    text   info1     info2      info3

I would think that an AWK command would be useful but am having trouble structuring this appropriately. Any help would be much appreciated.

3
  • Are those dots before REP really there, or does that represent more columns? Commented Sep 30, 2016 at 17:22
  • the ... represent additional text in col4 that occurs prior to 'REP=' Commented Sep 30, 2016 at 17:24
  • Is there a tab before "REP"? Commented Sep 30, 2016 at 17:30

3 Answers 3

2

awk to the rescue!

$ awk -v OFS='\t' 'NR==1{nh=NF; header=$0; next} 
                        {v=$NF; 
                         sub(/.*REP=/,"",v);
                         sub(/\.\.\.REP=.*/,"",$NF); 
                         gsub(/[()]/,"",v); 
                         n=split(v,vs,"|"); 
                         for(i=1;i<=n;i++) $(NF+i)=vs[i]} 
                   NR==2{printf "%s", header; 
                         for(i=1;i<=n;i++) printf "%s", OFS "col"(nh+i); 
                         print ""}1' file | column -t

col1  col2  col3  col4  col5   col6   col7
ID1   text  text  text  info1  info2  info3
ID2   text  text  text  info1  info2  info3
Sign up to request clarification or add additional context in comments.

2 Comments

don't be in a rush to accept the answer; upvote is fine, perhaps there will be better solution if you wait little longer. I do these as speed programming exercise without much thought.
Not a fan of the indentation style, but this is just how I implemented it.
1

perl one liner, doesn't modify the header though

$ cat ip.txt 
col1    col2    col3    col4
ID1     text    text    text REP=(info1|info2|info3)
ID2     text    text    text REP=(info1|info2|info3)

$ perl -pe 's/\s*REP=\(([^)]+)\)/"\t".$1=~tr#|#\t#r/e' ip.txt
col1    col2    col3    col4
ID1     text    text    text    info1   info2   info3
ID2     text    text    text    info1   info2   info3
  • \s*REP=\(([^)]+)\) zero or more whitespaces, followed by REP( followed by capture group to extract characters other than ) and finally a )
  • e modifier allows to use Perl code in replacement section
  • $1=~tr#|#\t#r change | to tabs from the captured group, which is then concatenated to string containing a tab

1 Comment

When I try to run this code I am getting an error as follows -- Bareword found where operator expected at -e line 1, near "s/\|/\t/gr" syntax error at -e line 1, near "s/\|/\t/gr " Execution of -e aborted due to compilation errors.
0

This does leave a tab at the end, but that can be fixed with an extra gsub.

awk 'NR==1 {print $0,"col4\tnewcol\tnewcol2\tnewcol3")} NR>1 {gsub(/...REP=\(|\||\)/, "\t");print}' input.txt

1 Comment

You only need to update the header on the first line, not every line: awk 'NR==1 {print $0, "\tnewcol1..."} NR>1 {gsub(/REP .../...); print}'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.