I'll leave the output formatting to you but here's how to create an array of the fields after dealing with embedded commas and escaped quotes and undesirable spaces surrounding some fields so you can then do whatever you want with them:
$ cat tst.awk
BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")" }
{ sub(/#.*/,"") }
NF {
# replace all escaped quotes with a newline and resplit the record
gsub(/\\"/,RS)
for (i=1;i<=NF;i++) {
# restore the escaped quotes in this field
gsub(RS,"\\\"",$i)
f[i] = $i
}
for (i=1;i<=NF;i++) {
# remove this to leave leading/trailing white space:
gsub(/^[[:space:]]+|[[:space:]]+$/,"",f[i])
# remove this to leave quotes around fields:
gsub(/^"|"$/,"",f[i])
print NR, NF, i, "<" f[i] ">"
}
print "----"
}
.
$ awk -f tst.awk file
4 5 1 <header1 with quotes>
4 5 2 <header2withoutquotes>
4 5 3 <header3>
4 5 4 <header4>
4 5 5 <hdeader5>
----
5 5 1 <Sample Text>
5 5 2 <2>
5 5 3 <3>
5 5 4 <4>
5 5 5 <MoreText, with commas>
----
6 5 1 <Text2 with escaped \">
6 5 2 <8>
6 5 3 <6>
6 5 4 <7>
6 5 5 <9>
----
7 5 1 <Text3>
7 5 2 <876>
7 5 3 <0.6>
7 5 4 <7>
7 5 5 <10>
----
The above uses GNU awk for FPAT, with other awks you'd need a while(match(...)) loop.
See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for how FPAT works to split the input into fields. Other than that:
- The first
sub() and test for NF discard comments and empty lines.
- The
gsub() before the loop replaces every occurrence of \" with a newline so that escaped quotes aren't in the way of field splitting and the fact that this op works on the whole record causes awk to re-split afterwards so the FPAT is again applied at that point, assuring that the original \"s have no effect on the fields going into the loop.
- The
gsub() in the 1st loop restores any \"s that were originally present in the current field
- The 1st
gsub() in the 2nd loop just trims all leading and trailing white space off the current field.
- The 2nd [optional]
gsub() in the 2nd loop removes start/end quotes from the field.
and the rest should be obvious. I'm stripping leading/trailing spaces and quotes where f[] is used rather than where it's populated since you seem to want at least 2 different outputs, one with surrounding quotes and one without, but its your choice where either of those gsub()s is done.
To learn awk - get the book Effective Awk programming, 4th Edition, by Arnold Robbins.