Using awk or perl to extract specific columns from CSV (parsing)

Question

Background - I want to extract specific columns from a csv file. The csv file is comma delimited, uses double quotes as the text-qualifier (optional, but when a field contains special characters, the qualifier will be there - see example), and uses backslashes as the escape character. It is also possible for some fields to be blank.

Example Input and Desired Output - For example, I only want columns 1, 3, and 4 to be in the output file. The final extract of the columns from the csv file should match the format of the original file. No escape characters should be removed or extra quotes added and such.

Input

"John \"Super\" Doe",25,"123 ABC Street",123-456-7890,"M",A
"Jane, Mary","",132 CBS Street,333-111-5332,"F",B
"Smith \"Jr.\", Jane",35,,555-876-1233,"F",
"Lee, Jack",22,123 Sesame St,"","M",D

Desired Output

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

Preliminary Script (awk) - The following is a preliminary script I found that works for the most part, but does not work in one particular instance that I noticed and possibly more that I have not seen or thought of yet

#!/usr/xpg4/bin/awk -f

BEGIN{  OFS = FS = ","  }

/"/{
    for(i=1;i<=NF;i++){
        if($i ~ /^"[^"]+$/){
            for(x=i+1;x<=NF;x++){
                $i=$i","$x
                if($i ~ /"+$/){
                    z = x - (i + 1) + 1
                    for(y=i+1;y<=NF;y++)
                        $y = $(y + z)
                    break
                }
            }
            NF = NF - z
            i=x
        }
    }
print $1,$3,$4
}

The above seems to work well until it comes across a field that contains both escaped double quotes as well as a comma. In that case, the parsing will be off and the output will be incorrect.

Question/Comments - I have read that awk is not the best option for parsing through csv files, and perl is suggested. However, I do not know perl at all. I have found some examples of perl scripts, but they do not give the desired output I am looking for and I do not know how to edit the scripts easily for what I want.

As for awk, I am familiar with it and use the basic functionality of it occasionally, but I do not know a lot of the advanced functionality like some of the commands used in the script above. Is my desired output possible just by using awk? If so, would it be possible edit the script above to fix the issue I am having with it? Could someone explain line by line what exactly the script is doing?

Any help would be appreciated, thanks!

The reason why perl is suggested over awk is because the former has the capability to do look-ahead/look-behind assertions in order to discriminate a field separator from an internal field value — SiegeX
– SiegeX, Commented Feb 15, 2012 at 4:25
@SiegeX - sorry, but you're way wrong. Perl is suggested over awk because there are 100% working, fully (or almost) debugged stable production quality CSV parsing modules on CPAN, so that you don't have to reinvent (poorly) the bicycle. Specifically, Text::CSV is usually considered a classic. — DVK
– DVK, Commented Feb 15, 2012 at 4:53
Is there a particular reason for the prohibition against "extra quotes added" part? Also, are the quotes for fields obeying some 100% involatile standard rule for input file? (e.g. "only quote fields that contain spaces, commas, or quotes")? — DVK
– DVK, Commented Feb 15, 2012 at 4:56
@DVK, No, no such rule. It's random whether quotes are used or not. — ikegami
– ikegami, Commented Feb 15, 2012 at 5:06
@DVK - No, there is no reason to prohibit extra quotes added as ikegami mentioned. I just mentioned that to emphasize I wanted the output file to be as closely formatted to the original as possible — yousir
– yousir, Commented Feb 15, 2012 at 17:46

ikegami · Accepted Answer · 2012-02-15 05:16:48Z

10

I'm not going to reinvent the wheel.

use Text::CSV_XS;

my $csv = Text::CSV_XS->new({
   binary      => 1,
   escape_char => '\\',
   eol         => "\n",
});

my $fh_in  = \*STDIN;
my $fh_out = \*STDOUT;

while (my $row = $csv->getline($fh_in)) {
   $csv->print($fh_out, [ @{$row}[0,2,3] ])
      or die("".$csv->error_diag());
}

$csv->eof()
   or die("".$csv->error_diag());

Output:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary","132 CBS Street",333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack","123 Sesame St",

It adds quotes around addresses that didn't have any already, but since some addresses already have quotes around them, you obviously can handle that.

Reinventing the wheel:

my $field = qr/"(?:[^"\\]|\\.)*"|[^"\\,]*/s;
while (<>) {
   my @fields = /^($field),$field,($field),($field),/
      or die;
   print(join(',', @fields), "\n");
}

Output:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

edited Feb 15, 2012 at 5:16

answered Feb 15, 2012 at 4:53

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

yousir Over a year ago

Thank you for your solutions. Unfortunately, I am not able to use your first solution since the machine I am using does not have the Text::CSV_XS module and I am not able to install it. The second (reinvented) solution works for what I need. However, the only problem is the part where it specifies which columns to print out. Is there a way to specify which columns similar to the first solution where you can just list the column numbers? Potentially, my csv file can have hundreds of columns and I need to be able to easily change which columns to parse out.

DVK Over a year ago

@yousir - you cn use Text::SCV instead. It's pure Perl

ikegami Over a year ago

@yousir, You did not say why you cannot install it, so we can neither help you install it or find a workaround if we don't know what needs to be worked around.

ikegami Over a year ago

@yousir, I didn't make it so you could pick other columns because that wasn't your question. But really, it's trivial to build the pattern dynamically to pick other columns.

yousir Over a year ago

@ikegami - Unless I was following the wrong instructions, I indeed do need additional privileges to install a a module from CPAN. Regardless, I was able to find a workaround way to "install" Text:CSV as DVK suggested and utilize your first script to achieve what I wanted. I simply had to put CSV.pm and CSV_PP.pm from the Text:CSV source in a folder named "Text" in the working directory of the script.

|

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

I'd suggest python csv module:

#!/usr/bin/env python3
import csv
rdr = csv.reader(open('input.csv'), escapechar='\\')
wtr = csv.writer(open('output.csv', 'w'), escapechar='\\', doublequote=False)
for row in rdr:
    wtr.writerow(row[0:1]+row[2:4])

output.csv

John \"Super\" Doe,123 ABC Street,123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Feb 15, 2012 at 5:44

kev

163k49 gold badges286 silver badges282 bronze badges

1 Comment

ikegami Over a year ago

Removing double quotes where they exist is worse than adding some where they aren't any.

Sandeep · Accepted Answer · 2012-02-15 05:07:15Z

0

The following command will extract the required fields(e.g., first, third and fourth) separated by delimiter ',' from sample.csv file and displays the output in the console. cut -f1,3,4 -d',' sample.txt If you want to store the output in a new csv file, then redirect the output to a file as below cut -f1,3,4 -d',' sample.txt > newSample.csv

answered Feb 15, 2012 at 5:07

Sandeep

1

Comments

Joel Berger · Accepted Answer · 2013-05-10 13:54:15Z

Before I post, I see now that this is an old question bumped by an already deleted answer, however, I thought I would still use the opportunity to show off Tie::Array::CSV which make CSV file manipulation as easy as working with Perl arrays. Full disclosure: I'm the author.

Anyway here is the script. The OP's data required changing the escape character and Perl indexes arrays starting at 0, but other than that this should be quite readable.

#!/usr/bin/env perl

use strict;
use warnings;

use Tie::Array::CSV;

my $opts = { text_csv => { escape_char => '\\' } };

tie my @input,  'Tie::Array::CSV', 'data', $opts or die "Cannot open file 'data': $!";
tie my @output, 'Tie::Array::CSV', 'out',  $opts or die "Cannot open file 'out': $!";

for my $row (@input) {
  my @slice = @{ $row }[0,2,3];
  push @output, \@slice;
}

That said, I think that last loop doesn't loose too much readability if I convert it to the (IMO) more impressive form:

push @output, [ @{$_}[0,2,3] ] for @input;

Amnon · Accepted Answer · 2014-10-26 06:48:43Z

0

csvkit is a tool that handles csv files and allows such operations (among other features).

see csvcut. Its command line interface is compact and it handles the multitude of csv formats (tsv, other delimiters, encodings, escape chars etc.)

What you asked for can be done using:

csvcut --columns 0,2,3 input.csv

edited Oct 26, 2014 at 6:48

answered Oct 23, 2014 at 8:24

Amnon

2,3401 gold badge20 silver badges37 bronze badges

Comments

Marc Lambrichs · Accepted Answer · 2017-10-08 04:36:32Z

GNU awk solution. Just using the wheel as a wheel. You can define what fields should look like using FPAT, like this:

$ awk -vFPAT='[^,]+|"[^"]*"' -vOFS=, '{print $1, $3, $4}' file

which results in:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\",35,555-876-1233
"Lee, Jack",123 Sesame St,""

Explanation of the regex:

[^,]+           # 1 or more occurrences of anything that's not a comma, 
|               # OR
"[^"]*"         # 0 or more characters unequal to '"' enclosed by '"'

Read about FPAT in the gawk manual

Now, walking you through your script. Basically it tries to rewrite what your fields look like. At first, you split by ",", which obviously causes some problems. Next, it looks for fields that are not properly closed by a '"'.

BEGIN{OFS=FS =","}                        # set field sep (FS) and output field 
                                          #   sep to ,
/"/{                                      # for each line matching '"'
    for(i=1;i<=NF;i++){                   # loop through fields 1 to NF
        if($i ~ /^"[^"]+$/){              # IF field $i start with '"', followed by
                                          #   non-quotes
            for(x=i+1;x<=NF;x++){         # loop through ALL following fields
                $i=$i","$x                # concatenate field $i with ALL following 
                                          #   fields, separated by ","
                if($i ~ /"+$/){           # IF field $i ends with '"'
                    z = x - (i + 1) + 1   # z is index of field we're looking at next
                    for(y=i+1;y<=NF;y++)  
                        $y = $(y + z)     # change contents of following fields to 
                                          #   contents of field, z steps further
                                          #   down the line
                    break                 # break out of for(x) loop
                }
            }
            NF = NF - z                   # reset number of fields
            i=x                           # continue loop for(i) at index x
        }
    }
 print $1,$3,$4
}

You script fails on this input line:

"Smith \"Jr.\", Jane",35,,555-876-1233,"F",

simply because $i ~ /^"[^"]+$/ fails on $1.

I hope you agree with me that rewriting the fields like this can be tricky. More than that, it's like "O, I like awk, but I'm going to use it like C/perl/python." Using FPAT is a shorter solution, to say the least.

Claes Wikner · Accepted Answer · 2017-10-08 15:51:45Z

0

I made some mistakes hopefully corrected now.

awk '{sub(/y",""/,"y\42")sub(/,2.|,3./,"")sub(/,".",.*/,"")}1' file

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

edited Oct 8, 2017 at 15:51

answered Oct 8, 2017 at 1:42

Claes Wikner

1,5271 gold badge9 silver badges8 bronze badges

1 Comment

Marc Lambrichs Over a year ago

Output of 2nd line doesn't comply with OP's.

Collectives™ on Stack Overflow

Using awk or perl to extract specific columns from CSV (parsing)

7 Answers 7

10 Comments

output.csv

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

10 Comments

output.csv

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related