1

I have a tab delimited file with three columns. Each of the row in the 3rd column holds a string that has 4 names, each separated from the other by space (' '), but in some cases there are more than one space separated between the names. I'd like to use a unix-bash command line to print column 1, column 2, name1, name2, name3, name4, name5, all separated by tab.

My desired output would look like this:

avov2323[tab]rogoc232[tab]Roy[tab]Don[tab]Mike[tab]Ned[tab]Lee
cdso3432[tab]fokfd543[tab]Tom[tab]Gil[tab]Rose[tab]Dan[tab]Sam
  • Is there a way to store all my column 3 into a variable and then split this specific variable based on spaces? something like: a=awk -F "\t" '{print $3}' file.txt;awk -F " " '{print $1}' $a;

although - this command line doesn't work for me... as all the names from column 3 get cramped to each other in $a.

1
  • please show input samples.. with and without unsignificant spaces! Commented Oct 8, 2014 at 18:07

3 Answers 3

3

Use tr to translate:

tr <inputFile " " "\t" | tr -s "\t" >outputFile

Edit: As Glenn Jackman pointed out, it would be better to first squeeze spaces, then change remaining spaces to tabs.

tr <inputFile -s " " | tr " " "\t" >outputFile

It's still vulnerable to spaces in first two columns though.

Sign up to request clarification or add additional context in comments.

8 Comments

please note my edition to the question ( in some cases there are more than one space separated between the names)
Yes, you're right. I missed that one. I've adapted my little monster.
+1 I like this. One risk: squeezing tabs will blow up when there are empty fields. You might want to squeeze spaces first, then translate spaces to tabs
This one is a great solution as well. Is there a way though to use it as in a pipe? I am creating my file one step before and would like to use the resultant file in a pipe and amend its spaces & tabs as you suggest. for example, my pipe for Tom Fenech solution is this: awk 'NR==FNR{a[$2]=$0;next} ($3) in a{print $0, a[$3]}' 13PatientsInputsGQ30DP8plinkGenoLogistic.assoc.logistic /cygdrive/h/SNPs_GATK/OnlyInputsOf13Patients/SplitChromosomes/2_CarlosSNPIDsOutput/vcfWithId.vcf | awk '{$1=$1}1' OFS='\t' > UnionLogistic_vcf.txt
Of course. You do not provide the <inputFile but prepend first tr with your command | . Of course your command should not write to the file then, but to pipe instead.
|
1

You could use awk:

$ cat file
avov2323        rogoc232        Roy  Don Mike  Ned Lee
cdso3432        fokfd543        Tom Gil    Rose  Dan Sam
$ awk '{$1=$1}1' OFS='\t' file
avov2323        rogoc232        Roy     Don     Mike    Ned     Lee
cdso3432        fokfd543        Tom     Gil     Rose    Dan     Sam

$1=$1 just touches each record so the new output format is applied. 1 evaluates to true, so each line is printed. Awk treats any number of whitespace characters as the input field separator, so as you can see, the number of spaces between each name is not a problem.

To overwrite the original file, you can use a temporary file:

awk '{$1=$1}1' OFS='\t' file > tmp && mv tmp file

3 Comments

thanks a million! - this one works very well for me.
Even simplier is awk -v OFS='\t' '$1=$1'. You can check the result with echo -e avov2323\\trogoc232\\tRoy Don Mike Ned Lee | awk -v OFS='\t' '$1=$1' | od -c
@Vytenis the only potential downside of using $1=$1 is that when the first column evaluates to false (for example, it is "0"), the line will not be printed. For example, awk '$1=$1' <<<"0" doesn't print anything. In this case, that doesn't seem to be a problem though.
1

Just for sake of completeness, I also wrote an awk oneliner, which won't touch any spaces in first two columns. It also preserves empty columns:

awk <inputFile -F '\t' 'BEGIN{OFS="\t"} {gsub(/ +/,OFS,$3); print $1,$2,$3}'

Edit: Regarding improvement mentioned in comment - yes, it is possible to split any column, even the middle one, though a more versatile script would be necessary. It's not a oneliner however and looks quite awkward when put in one line. I'm pretty sure it still could be somewhat optimized. With formatting:

BEGIN {
  FS=OFS="\t";
  splitAt=3;
}{
  gsub(/ +/,OFS,$splitAt);
  line=$1;
  for(i=2;i<splitAt;i++)
    line=line""OFS""$i;
  line=line""OFS""$splitAt;
  for(i=splitAt+1;i<=NF;i++)
    line=line""OFS""$i;
  print line;
}

And in charge:

awk <inputFile 'BEGIN{FS=OFS="\t"; splitAt=2;} {gsub(/ +/,OFS,$splitAt); line=$1; for(i=2;i<splitAt;i++) line=line""OFS""$i; line=line""OFS""$splitAt; for(i=splitAt+1;i<=NF;i++) line=line""OFS""$i; print line ;}'

Could be refactored to provide splitAt as a parameter to script.

2 Comments

Hi Krzysztof, your solution works well! - but what if I would need to split based on spaces not my 3rd column, but my 35th column - and then print all the tab delimited columns from 1 to 35, including the newly formed columns that were nested (by spaces) in the 35th column?... is there a way to incorporate this into your command - rather than tediously type at the end: ;print $1,$2,$3,$4,..etc..,$35}' ?? thanks a lot!
Ok, the solution for my question is actually this: awk -F '\t' 'BEGIN{OFS="\t"} {gsub(/ +/,"\t",$35); for(i=1;i<=35;i++) printf "%s",$i "\t";printf "\n"}' aaa.txt > mine.txt

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.