7

Background - I want to extract specific columns from a csv file. The csv file is comma delimited, uses double quotes as the text-qualifier (optional, but when a field contains special characters, the qualifier will be there - see example), and uses backslashes as the escape character. It is also possible for some fields to be blank.


Example Input and Desired Output - For example, I only want columns 1, 3, and 4 to be in the output file. The final extract of the columns from the csv file should match the format of the original file. No escape characters should be removed or extra quotes added and such.

Input

"John \"Super\" Doe",25,"123 ABC Street",123-456-7890,"M",A
"Jane, Mary","",132 CBS Street,333-111-5332,"F",B
"Smith \"Jr.\", Jane",35,,555-876-1233,"F",
"Lee, Jack",22,123 Sesame St,"","M",D

Desired Output

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

Preliminary Script (awk) - The following is a preliminary script I found that works for the most part, but does not work in one particular instance that I noticed and possibly more that I have not seen or thought of yet

#!/usr/xpg4/bin/awk -f

BEGIN{  OFS = FS = ","  }

/"/{
    for(i=1;i<=NF;i++){
        if($i ~ /^"[^"]+$/){
            for(x=i+1;x<=NF;x++){
                $i=$i","$x
                if($i ~ /"+$/){
                    z = x - (i + 1) + 1
                    for(y=i+1;y<=NF;y++)
                        $y = $(y + z)
                    break
                }
            }
            NF = NF - z
            i=x
        }
    }
print $1,$3,$4
}

The above seems to work well until it comes across a field that contains both escaped double quotes as well as a comma. In that case, the parsing will be off and the output will be incorrect.


Question/Comments - I have read that awk is not the best option for parsing through csv files, and perl is suggested. However, I do not know perl at all. I have found some examples of perl scripts, but they do not give the desired output I am looking for and I do not know how to edit the scripts easily for what I want.

As for awk, I am familiar with it and use the basic functionality of it occasionally, but I do not know a lot of the advanced functionality like some of the commands used in the script above. Is my desired output possible just by using awk? If so, would it be possible edit the script above to fix the issue I am having with it? Could someone explain line by line what exactly the script is doing?

Any help would be appreciated, thanks!

5
  • The reason why perl is suggested over awk is because the former has the capability to do look-ahead/look-behind assertions in order to discriminate a field separator from an internal field value Commented Feb 15, 2012 at 4:25
  • 4
    @SiegeX - sorry, but you're way wrong. Perl is suggested over awk because there are 100% working, fully (or almost) debugged stable production quality CSV parsing modules on CPAN, so that you don't have to reinvent (poorly) the bicycle. Specifically, Text::CSV is usually considered a classic. Commented Feb 15, 2012 at 4:53
  • Is there a particular reason for the prohibition against "extra quotes added" part? Also, are the quotes for fields obeying some 100% involatile standard rule for input file? (e.g. "only quote fields that contain spaces, commas, or quotes")? Commented Feb 15, 2012 at 4:56
  • @DVK, No, no such rule. It's random whether quotes are used or not. Commented Feb 15, 2012 at 5:06
  • @DVK - No, there is no reason to prohibit extra quotes added as ikegami mentioned. I just mentioned that to emphasize I wanted the output file to be as closely formatted to the original as possible Commented Feb 15, 2012 at 17:46

7 Answers 7

10

I'm not going to reinvent the wheel.

use Text::CSV_XS;

my $csv = Text::CSV_XS->new({
   binary      => 1,
   escape_char => '\\',
   eol         => "\n",
});

my $fh_in  = \*STDIN;
my $fh_out = \*STDOUT;

while (my $row = $csv->getline($fh_in)) {
   $csv->print($fh_out, [ @{$row}[0,2,3] ])
      or die("".$csv->error_diag());
}

$csv->eof()
   or die("".$csv->error_diag());

Output:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary","132 CBS Street",333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack","123 Sesame St",

It adds quotes around addresses that didn't have any already, but since some addresses already have quotes around them, you obviously can handle that.


Reinventing the wheel:

my $field = qr/"(?:[^"\\]|\\.)*"|[^"\\,]*/s;
while (<>) {
   my @fields = /^($field),$field,($field),($field),/
      or die;
   print(join(',', @fields), "\n");
}

Output:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""
Sign up to request clarification or add additional context in comments.

10 Comments

Thank you for your solutions. Unfortunately, I am not able to use your first solution since the machine I am using does not have the Text::CSV_XS module and I am not able to install it. The second (reinvented) solution works for what I need. However, the only problem is the part where it specifies which columns to print out. Is there a way to specify which columns similar to the first solution where you can just list the column numbers? Potentially, my csv file can have hundreds of columns and I need to be able to easily change which columns to parse out.
@yousir - you cn use Text::SCV instead. It's pure Perl
@yousir, You did not say why you cannot install it, so we can neither help you install it or find a workaround if we don't know what needs to be worked around.
@yousir, I didn't make it so you could pick other columns because that wasn't your question. But really, it's trivial to build the pattern dynamically to pick other columns.
@ikegami - Unless I was following the wrong instructions, I indeed do need additional privileges to install a a module from CPAN. Regardless, I was able to find a workaround way to "install" Text:CSV as DVK suggested and utilize your first script to achieve what I wanted. I simply had to put CSV.pm and CSV_PP.pm from the Text:CSV source in a folder named "Text" in the working directory of the script.
|
2

I'd suggest python csv module:

#!/usr/bin/env python3
import csv
rdr = csv.reader(open('input.csv'), escapechar='\\')
wtr = csv.writer(open('output.csv', 'w'), escapechar='\\', doublequote=False)
for row in rdr:
    wtr.writerow(row[0:1]+row[2:4])

output.csv

John \"Super\" Doe,123 ABC Street,123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,

1 Comment

Removing double quotes where they exist is worse than adding some where they aren't any.
0

The following command will extract the required fields(e.g., first, third and fourth) separated by delimiter ',' from sample.csv file and displays the output in the console. cut -f1,3,4 -d',' sample.txt If you want to store the output in a new csv file, then redirect the output to a file as below cut -f1,3,4 -d',' sample.txt > newSample.csv

Comments

0

Before I post, I see now that this is an old question bumped by an already deleted answer, however, I thought I would still use the opportunity to show off Tie::Array::CSV which make CSV file manipulation as easy as working with Perl arrays. Full disclosure: I'm the author.

Anyway here is the script. The OP's data required changing the escape character and Perl indexes arrays starting at 0, but other than that this should be quite readable.

#!/usr/bin/env perl

use strict;
use warnings;

use Tie::Array::CSV;

my $opts = { text_csv => { escape_char => '\\' } };

tie my @input,  'Tie::Array::CSV', 'data', $opts or die "Cannot open file 'data': $!";
tie my @output, 'Tie::Array::CSV', 'out',  $opts or die "Cannot open file 'out': $!";

for my $row (@input) {
  my @slice = @{ $row }[0,2,3];
  push @output, \@slice;
}

That said, I think that last loop doesn't loose too much readability if I convert it to the (IMO) more impressive form:

push @output, [ @{$_}[0,2,3] ] for @input;

Comments

0

csvkit is a tool that handles csv files and allows such operations (among other features).

see csvcut. Its command line interface is compact and it handles the multitude of csv formats (tsv, other delimiters, encodings, escape chars etc.)

What you asked for can be done using:

csvcut --columns 0,2,3 input.csv

Comments

0

GNU awk solution. Just using the wheel as a wheel. You can define what fields should look like using FPAT, like this:

$ awk -vFPAT='[^,]+|"[^"]*"' -vOFS=, '{print $1, $3, $4}' file

which results in:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\",35,555-876-1233
"Lee, Jack",123 Sesame St,""

Explanation of the regex:

[^,]+           # 1 or more occurrences of anything that's not a comma, 
|               # OR
"[^"]*"         # 0 or more characters unequal to '"' enclosed by '"'

Read about FPAT in the gawk manual

Now, walking you through your script. Basically it tries to rewrite what your fields look like. At first, you split by ",", which obviously causes some problems. Next, it looks for fields that are not properly closed by a '"'.

BEGIN{OFS=FS =","}                        # set field sep (FS) and output field 
                                          #   sep to ,
/"/{                                      # for each line matching '"'
    for(i=1;i<=NF;i++){                   # loop through fields 1 to NF
        if($i ~ /^"[^"]+$/){              # IF field $i start with '"', followed by
                                          #   non-quotes
            for(x=i+1;x<=NF;x++){         # loop through ALL following fields
                $i=$i","$x                # concatenate field $i with ALL following 
                                          #   fields, separated by ","
                if($i ~ /"+$/){           # IF field $i ends with '"'
                    z = x - (i + 1) + 1   # z is index of field we're looking at next
                    for(y=i+1;y<=NF;y++)  
                        $y = $(y + z)     # change contents of following fields to 
                                          #   contents of field, z steps further
                                          #   down the line
                    break                 # break out of for(x) loop
                }
            }
            NF = NF - z                   # reset number of fields
            i=x                           # continue loop for(i) at index x
        }
    }
 print $1,$3,$4
}

You script fails on this input line:

"Smith \"Jr.\", Jane",35,,555-876-1233,"F",

simply because $i ~ /^"[^"]+$/ fails on $1.

I hope you agree with me that rewriting the fields like this can be tricky. More than that, it's like "O, I like awk, but I'm going to use it like C/perl/python." Using FPAT is a shorter solution, to say the least.

Comments

0

I made some mistakes hopefully corrected now.

awk '{sub(/y",""/,"y\42")sub(/,2.|,3./,"")sub(/,".",.*/,"")}1' file

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

1 Comment

Output of 2nd line doesn't comply with OP's.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.