Parsing a csv file into an array using awk

Question

I have to parse a csv file and dump the contents of it into mysql tables.

# myfile.csv

# Contents
# Sample Headers

"header1 with quotes", header2withoutquotes, "header3", header4, hdeader5
"Sample Text",2,3,4,"MoreText, with commas"
"Text2 with escaped \"",8,6,7,9
"Text3",876,0.6,7,10

First output

rowid|header1 with quotes|Sample Text|myfile
1|header2withoutquotes|2|myfile
1|header3|3|myfile
1|header4|4|myfile
1|header5|MoreText, with commas|myfile

2|header1 with quotes|Text2 with escaped \"|myfile
2|header2withoutquotes|8|myfile
2|header3|6|myfile
2|header4|7|myfile
2|header5|9|myfile

3|header1 with quotes|text3|myfile
3|header2withoutquotes|876|myfile
3|header3|0.6|myfile
3|header4|7|myfile
3|header5|10|myfile

In the Second Output i will need custom headers to be horizontally aligned. For e.g

rowid|"header1 with quotes"|"header3"|header4|filename 
1|Sample Text|3,4,myfile
2|Text2 with escaped \"|6|7|myfile
3|Text3|0.6|7|myfile

For the second output, it can be any set of headers that i choose. I can then load both this output data into mysql tables using load data infile. Looking for awk scripts to achieve this. Let me know if you need anything else. Tx.

If you need to deal with the full complexity of CSV with embedded commas and quotes, you are probably best of using Python or Perl and the CSV modules that are available with them, or a specialist tool like CSVfix (which was hosted on Google Code at one time, but that's now shut up shop; I'm not sure of the official source for it these days, which is embarrassing). — Jonathan Leffler
– Jonathan Leffler, Commented Feb 3, 2016 at 7:51

Cynical · Accepted Answer · 2016-01-31 09:12:19Z

1

This should work:

{
    if(NR==1)
        split($0,header,",")
    else
    {
        split($0,line,",")
        for (i in line)  
        {
            gsub(/^[ \t]+|"|[ \t]+$)/, "", header[i]); 
            gsub(/^[ \t]+|"|[ \t]+$)/, "", line[i]); 
            print header[i]"|"line[i]"|"FILENAME
        }
        print ""
    }
}

Basically it stores the first line in the header array, then it splits each line in the elem array and trims away leading and trailing spaces or tabs. Finally, it composes the output string.

Output:

header1|text1|file2
header2|2|file2
header3|3|file2
header4|4|file2
hdeader5|moretext|file2

header1|text2|file2
header2|8|file2
header3|6|file2
header4|7|file2
hdeader5|9|file2

header1|text3|file2
header2|876|file2
header3|0.6|file2
header4|7|file2
hdeader5|10|file2

You can get rid of the newlines between each block by removing the last print "" statement.

edited Jan 31, 2016 at 9:12

answered Jan 31, 2016 at 9:06

Cynical

9,6281 gold badge19 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

user676500 Over a year ago

Tx @Cynical. I forgot to mention about commas existing between texts too. Say for example: "Some, more text with,commas in between". These will be an issue right? Just ran this script on my csv files and realised my mistake. Any suggestions?

user676500 Over a year ago

Sample: "Some text",Text,"Some text, with comma","Normal text in quotes". Tx again.

Cynical Over a year ago

Yes, that could be an issue... Is it possible that text-with-commas are always enclosed between quotes?

tomc Over a year ago

preprocess change your field separator from commas to something not contained in your fields, tabs ,tilde,pipe ...

user676500 Over a year ago

Yes, the ones with commas within text will always be within quotes. The ones without commas may or maynot be within quotes. So you may have words like: Text and "Text" and "Text, with comma" all in the same file.

|

Ed Morton · Accepted Answer · 2016-02-03 14:16:29Z

I'll leave the output formatting to you but here's how to create an array of the fields after dealing with embedded commas and escaped quotes and undesirable spaces surrounding some fields so you can then do whatever you want with them:

$ cat tst.awk
BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")" }
{ sub(/#.*/,"") }
NF {
    # replace all escaped quotes with a newline and resplit the record
    gsub(/\\"/,RS)

    for (i=1;i<=NF;i++) {
        # restore the escaped quotes in this field
        gsub(RS,"\\\"",$i)

        f[i] = $i
    }

    for (i=1;i<=NF;i++) {
        # remove this to leave leading/trailing white space:
        gsub(/^[[:space:]]+|[[:space:]]+$/,"",f[i])

        # remove this to leave quotes around fields:
        gsub(/^"|"$/,"",f[i])

        print NR, NF, i, "<" f[i] ">"
    }
    print "----"
}

.

$ awk -f tst.awk file
4 5 1 <header1 with quotes>
4 5 2 <header2withoutquotes>
4 5 3 <header3>
4 5 4 <header4>
4 5 5 <hdeader5>
----
5 5 1 <Sample Text>
5 5 2 <2>
5 5 3 <3>
5 5 4 <4>
5 5 5 <MoreText, with commas>
----
6 5 1 <Text2 with escaped \">
6 5 2 <8>
6 5 3 <6>
6 5 4 <7>
6 5 5 <9>
----
7 5 1 <Text3>
7 5 2 <876>
7 5 3 <0.6>
7 5 4 <7>
7 5 5 <10>
----

The above uses GNU awk for FPAT, with other awks you'd need a while(match(...)) loop.

See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for how FPAT works to split the input into fields. Other than that:

The first sub() and test for NF discard comments and empty lines.
The gsub() before the loop replaces every occurrence of \" with a newline so that escaped quotes aren't in the way of field splitting and the fact that this op works on the whole record causes awk to re-split afterwards so the FPAT is again applied at that point, assuring that the original \"s have no effect on the fields going into the loop.
The gsub() in the 1st loop restores any \"s that were originally present in the current field
The 1st gsub() in the 2nd loop just trims all leading and trailing white space off the current field.
The 2nd [optional] gsub() in the 2nd loop removes start/end quotes from the field.

and the rest should be obvious. I'm stripping leading/trailing spaces and quotes where f[] is used rather than where it's populated since you seem to want at least 2 different outputs, one with surrounding quotes and one without, but its your choice where either of those gsub()s is done.

To learn awk - get the book Effective Awk programming, 4th Edition, by Arnold Robbins.

Collectives™ on Stack Overflow

Parsing a csv file into an array using awk

2 Answers 2

12 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related