0

I'm working on CSV file

This my csv file

Command used for filtering awk -F"," '{print $14}' out_file.csv > test1.csv

This is an example of my data looks like i have around 43 Row and 12,000 column i planed to separate the single Row using awk command but i cant able to separate the row 3 alone (disease).

i use the following command to get my output

awk -F"," '{print $3}' out_file.csv > test1.csv

This is my file:

gender|gene_name  |disease         |1000g_oct2014|Polyphen |SNAP 
male  |RB1,GTF2A1L|cancer,diabetes |0.1          |0.46     |0.1  
male  |NONE,LOC441|diabetes        |0.003        |0.52     |0.6  
male  |TBC1D1     |diabetes        |0.940        |1        |0.9  
male  |BCOR       |cancer          |0            |0.31     |0.2  
male  |TP53       |diabetes        |0            |0.54     |0.4  

note "|" i did not use this a delimiter. it for show the row in an order my details looks exactly like this in the spreed sheet:

enter image description here

But i'm getting the output following way

Disease
GTF2A1L
LOC441
TBC1D1
BCOR
TP53

While opening in Spread Sheet i can get the results in the proper manner but when i uses awk the , in-between the row 2 is also been taken. i dont know why can any one help me with this.

13
  • in the image there indeed are pipes to separate fields. Commented Dec 22, 2015 at 10:52
  • 1
    Show us the output of head out_file.csv so we can see what's really being given to awk. Commented Dec 22, 2015 at 10:59
  • 1
    you said you didn't use | as FS, spaces (default) neither. You used ,(comma) in your codes, I cannot see how can you get your output from your input. E.g. the first line has no comma at all, why Disease was printed out there? Commented Dec 22, 2015 at 11:04
  • 1
    I can see you have comma-separated values, where some of the values themselves have commas. Those values are delimited by quotes as appropriate for CSV format. Unfortunately awk isn't so great for dealing with files like that. If you're exporting from excel, try using a different separator (tab or "|", e.g.) that you know won't occur in your values. That will make things much easier. Commented Dec 22, 2015 at 12:21
  • 1
    Please can you also add some sample data to your question in a text format. Commented Dec 22, 2015 at 12:23

3 Answers 3

1

The root of your problem is - you have comma separated values with embedded commas.

That makes life more difficult. I would suggest the approach is to use a csv parser.

I quite like perl and Text::CSV:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV;

open ( my $data, '<', 'data_file.csv' ) or die $!; 

my $csv = Text::CSV -> new ( { binary => 1, sep_char => ',', eol => "\n" } );

while ( my $row = $csv -> getline ( $data ) ) {
   print $row -> [2],"\n"; 
}

Of course, I can't tell for sure if that actually works, because the data you've linked on your google drive doesn't actually match the question you've asked. (note - perl starts arrays at zero, so [3] is actually the 4th field)

But it should do the trick - Text::CSV handles quoted comma fields nicely.

Sign up to request clarification or add additional context in comments.

1 Comment

i will try this and let you know.
0

Unfortunately the link you provided ("This is my file") points to two files, neither of which (at the time of this writing) seems to correspond with the sample you gave. However, if your file really is a CSV file with commas used both for separating fields and embedded within fields, then the advice given elsewhere to use a CSV-aware tool is very sound. (I would recommend considering a command-line program that can convert CSV to TSV so the entire *nix tool chain remains at your disposal.)

Your sample output and attendant comments suggest you may already have a way to convert it to a pipe-delimited or tab-delimited file. If so, then awk can be used quite effectively. (If you have a choice, then I'd suggest tabs, since then programs such as cut are especially easy to use.)

The general idea, then, is to use awk with "|" (or tab) as the primary separator (awk -F"|" or awk -F\\t), and to use awk's split function to parse the contents of each top-level field.

Comments

0

At last this is what i did for getting my answers in a simple way thanks to @peak i found the solution

1st i used the CSV filter which is an python module used for filtering the csv file. i changed my delimiters using csvfilter using the following command

csvfilter input_file.csv --out-delimiter="|" > out_file.csv

This command used to change the delimiter ',' into '|' now i used the awk command to sort and filter

awk -F"|" 'FNR == 1 {print} {if ($14 < 0.01) print }' out_file.csv > filtered_file.csv

Thanks for your help.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.