Using AWK to process CSV file with rows containing an array

Question

Trying to process a CSV file using AWK, however I have met a problem that many of my cells in my row already contain comma ,, meaning I can not separate field using awk -F,.

CSV FILE

Name,...DATE,COLUMNX,ADDRESSES
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"

The ... represents rows/columns containing [, ,, ', "

What I have tried:
awk -F, '{print $X}'
This give me following output:

'Element2']
"['192.168.x.101'

What I want to accomplish:

host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

How many columns does ,..., represent? Also, do any columns have ", [, or ' before that contining the required IP address? — Dave Pritlove
– Dave Pritlove, Commented Nov 24, 2022 at 11:30
Thanks. I will post an approach that will work but you may have to vary the column number to get the part you need. — Dave Pritlove
– Dave Pritlove, Commented Nov 24, 2022 at 11:44
Your CSV is invalid: a field that contains comma(s) shall be quoted (see how the ADDRESSES fields are) — Fravadona
– Fravadona, Commented Nov 24, 2022 at 12:30
Do you REALLY have some fields that are inside double quotes, e.g. "['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]" and some that aren't e.g. ['Element1', 'Element2']? Also, do you REALLY have some sub-fields that are inside single quotes, e.g. '192.168.x.100' and some that aren't e.g. the last fe80:XX on the same line? If you do then you should fix whatever is generating that mess, if you don't then please fix your example to show more accurate sample input/output. — Ed Morton
– Ed Morton, Commented Nov 24, 2022 at 23:26

Andre Wildberg · Accepted Answer · 2022-11-24 12:38:19Z

1

I'd recommend a proper CSV parser to do the job, then use awk to do the regex, e.g.

$ ruby -r 'csv' -ne 'lines=$_
  CSV.parse(lines) do |i| 
    i.each do |j| 
      printf("%s ", j)
    end
    puts ""
  end' file | 
awk '{gsub(/\[\047|\047\]|\047|\]|,/, "", $0)}
  /^host/{for(i=1;i<=NF;i++){if($i~/^[0-9]+\.+/){print $1, $i}}}'
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

answered Nov 24, 2022 at 12:38

Andre Wildberg

19.9k4 gold badges20 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ed Morton Over a year ago

The OPs input isn't proper CSV by any "standard" so YMMV trying to use a proper CSV parser on it.

Andre Wildberg Over a year ago

@EdMorton Ah, the file was updated. I used the old version which had correct quoting.

j_b · Accepted Answer · 2022-11-24 12:42:27Z

Using awk:

awk -F",\"|\"$" 'NR>1 { \
gsub(/\047|[\[\]]/,""); \
split($2,a,", "); \
split($1,h,","); \
for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n]}}' src.csv

Output:

host1 192.168.x.100
host1 192.168.x.99
host2 192.168.x.101

Details:

-F",\"|\"$" (split on ," or " at end of record (will remove trailing double quote and each record will be split into two fields.

gsub(/\047|[\[\]]/,""); (sanitize by removing both single quotes and/or brackets)

split($2,a,", "); (split second field into array a on comma-space)

split($1,h,","); (split first field into array h on comma.

for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n] Iterate over array a and only print output if array item starts with a number

Dave Pritlove · Accepted Answer · 2022-11-24 18:14:35Z

0

Modern versions of awk allow a record to split at more than one field separator. Thus, each line can be split at both commas and single quote marks to isolate the data you need.

To use ' as a field separator along with , requires the former to be escaped and it can be quite tricky to then combine the two. The simplest way I've found after a few trials is to use the shell F switch with a regular expression including the escaped ' and ,. It's messy as you have to close the first single quote before escaping the required one and re-opening a single-quotted command: -F'[,'\''=]' (I generally prefer setting field separators within the awk procedure but this one defeated me).

This edited version works to isolate the field (change $35 to suit by trail-and error):

awk -F'[,'\'']' 'NR>1{print $1" "$35}' data.csv

I tested the above on the following test file:

data.csv:
Name,...DATE,COLUMNX,ADDRESSES
host1,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.xxx.yy', 'fe80:XX','192.168.x.100', fe80:XX]"
host3,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.xxx.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host4,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['xxx.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"

output:

host1 192.168.x.99
host2 192.168.xxx.yy
host3 192.xxx.x.99
host4 xxx.168.x.99

edited Nov 24, 2022 at 18:14

answered Nov 24, 2022 at 11:45

Dave Pritlove

2,6973 gold badges17 silver badges16 bronze badges

12 Comments

Dave Pritlove Over a year ago

Apologies, I inadvertantly used $0 instead of $1 in the original post, edited to correct.

N. J Over a year ago

Should the ' inside the FS statement be escaped?

Dave Pritlove Over a year ago

No but I just realised also that the OR | is probably not needed as the square brackets mean OR already. And the quotes should probably be regex boundaries //. Really sorry, rushing and don't have time to make a test file but I think setting two delimerters will get what you need.

N. J Over a year ago

Unfortunately @DavePritlove, I'm just as close as with my original try: awk -F, '{print $X}'

Dave Pritlove Over a year ago

@EdMorton thanks, I like both of those.

|

Ed Morton · Accepted Answer · 2022-11-24 23:58:57Z

Using GNU awk for FPAT:

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
    n = split($NF,a,/\047/)
    for ( i=2; i<=n; i+=4 ) {
        print $1, a[i]
    }
}

$ awk -f tst.awk file
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

To see how FPAT is splitting the input into comma-separated fields and then split() is splitting the last field into '-separated subfields, add some prints, e.g.:

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
    print "============="
    print
    for ( i=1; i<=NF; i++ ) {
        print i, "<" $i ">"
    }
    n = split($NF,a,/\047/)
    for ( i=1; i<=n; i++ ) {
        print "\t" NF "." i, "<" a[i] ">"
    }
    for ( i=2; i<=n; i+=4 ) {
        print $1, a[i]
    }
}

$ awk -f tst.awk file
=============
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
1 <host1>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element1', 'Element2']>
6 <"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]">
        6.1 <"[>
        6.2 <192.168.x.99>
        6.3 <, >
        6.4 <fe80:XX>
        6.5 <,>
        6.6 <192.168.x.100>
        6.7 <, fe80:XX]">
host1 192.168.x.99
host1 192.168.x.100
=============
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"
1 <host2>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element3']>
6 <"['192.168.x.101', 'fe80:XX']">
        6.1 <"[>
        6.2 <192.168.x.101>
        6.3 <, >
        6.4 <fe80:XX>
        6.5 <]">
host2 192.168.x.101

See What's the most robust way to efficiently parse CSV using awk? for more information on FPAT and parsing CSVs with awk.

Collectives™ on Stack Overflow

Using AWK to process CSV file with rows containing an array

4 Answers 4

2 Comments

Comments

12 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

12 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related