Splitting nested column to multiple columns UNIX

Question

I have a tab delimited data frame with a final column containing nested information that is '|' delimited. Note that all rows maintain this nested parenthetical structure preceded by 'REP='

col1    col2    col3    col4
ID1     text    text    text...REP=(info1|info2|info3)
ID2     text    text    text...REP=(info1|info2|info3)

I would like to process this last column such that all info inside the parenthetical is a new column:

col1    col2    col3    col4   newcol    newcol2    newcol3
ID1     text    text    text   info1     info2      info3
ID2     text    text    text   info1     info2      info3

I would think that an AWK command would be useful but am having trouble structuring this appropriately. Any help would be much appreciated.

Are those dots before REP really there, or does that represent more columns? — glenn jackman
– glenn jackman, Commented Sep 30, 2016 at 17:22
the ... represent additional text in col4 that occurs prior to 'REP=' — AMS
– AMS, Commented Sep 30, 2016 at 17:24

karakfa · Accepted Answer · 2016-09-30 16:22:01Z

2

awk to the rescue!

$ awk -v OFS='\t' 'NR==1{nh=NF; header=$0; next} 
                        {v=$NF; 
                         sub(/.*REP=/,"",v);
                         sub(/\.\.\.REP=.*/,"",$NF); 
                         gsub(/[()]/,"",v); 
                         n=split(v,vs,"|"); 
                         for(i=1;i<=n;i++) $(NF+i)=vs[i]} 
                   NR==2{printf "%s", header; 
                         for(i=1;i<=n;i++) printf "%s", OFS "col"(nh+i); 
                         print ""}1' file | column -t

col1  col2  col3  col4  col5   col6   col7
ID1   text  text  text  info1  info2  info3
ID2   text  text  text  info1  info2  info3

answered Sep 30, 2016 at 16:22

karakfa

67.8k8 gold badges45 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

karakfa Over a year ago

don't be in a rush to accept the answer; upvote is fine, perhaps there will be better solution if you wait little longer. I do these as speed programming exercise without much thought.

glenn jackman Over a year ago

Not a fan of the indentation style, but this is just how I implemented it.

Sundeep · Accepted Answer · 2017-12-09 15:08:48Z

1

perl one liner, doesn't modify the header though

$ cat ip.txt 
col1    col2    col3    col4
ID1     text    text    text REP=(info1|info2|info3)
ID2     text    text    text REP=(info1|info2|info3)

$ perl -pe 's/\s*REP=\(([^)]+)\)/"\t".$1=~tr#|#\t#r/e' ip.txt
col1    col2    col3    col4
ID1     text    text    text    info1   info2   info3
ID2     text    text    text    info1   info2   info3

\s*REP=$([^)]+)$ zero or more whitespaces, followed by REP( followed by capture group to extract characters other than ) and finally a )
e modifier allows to use Perl code in replacement section
$1=~tr#|#\t#r change | to tabs from the captured group, which is then concatenated to string containing a tab

edited Dec 9, 2017 at 15:08

answered Sep 30, 2016 at 16:27

Sundeep

23.9k2 gold badges35 silver badges131 bronze badges

1 Comment

AMS Over a year ago

When I try to run this code I am getting an error as follows -- Bareword found where operator expected at -e line 1, near "s/\|/\t/gr" syntax error at -e line 1, near "s/\|/\t/gr " Execution of -e aborted due to compilation errors.

Scott Law · Accepted Answer · 2016-09-30 17:45:58Z

0

This does leave a tab at the end, but that can be fixed with an extra gsub.

awk 'NR==1 {print $0,"col4\tnewcol\tnewcol2\tnewcol3")} NR>1 {gsub(/...REP=\(|\||\)/, "\t");print}' input.txt

edited Sep 30, 2016 at 17:45

answered Sep 30, 2016 at 17:11

Scott Law

7781 gold badge7 silver badges16 bronze badges

1 Comment

glenn jackman Over a year ago

You only need to update the header on the first line, not every line: awk 'NR==1 {print $0, "\tnewcol1..."} NR>1 {gsub(/REP .../...); print}'

Collectives™ on Stack Overflow

Splitting nested column to multiple columns UNIX

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related