1

Trying to process a CSV file using AWK, however I have met a problem that many of my cells in my row already contain comma ,, meaning I can not separate field using awk -F,.

CSV FILE

Name,...DATE,COLUMNX,ADDRESSES
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"

The ... represents rows/columns containing [, ,, ', "

What I have tried:
awk -F, '{print $X}'
This give me following output:

'Element2']
"['192.168.x.101'

What I want to accomplish:

host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101
4
  • How many columns does ,..., represent? Also, do any columns have ", [, or ' before that contining the required IP address? Commented Nov 24, 2022 at 11:30
  • Thanks. I will post an approach that will work but you may have to vary the column number to get the part you need. Commented Nov 24, 2022 at 11:44
  • 1
    Your CSV is invalid: a field that contains comma(s) shall be quoted (see how the ADDRESSES fields are) Commented Nov 24, 2022 at 12:30
  • 1
    Do you REALLY have some fields that are inside double quotes, e.g. "['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]" and some that aren't e.g. ['Element1', 'Element2']? Also, do you REALLY have some sub-fields that are inside single quotes, e.g. '192.168.x.100' and some that aren't e.g. the last fe80:XX on the same line? If you do then you should fix whatever is generating that mess, if you don't then please fix your example to show more accurate sample input/output. Commented Nov 24, 2022 at 23:26

4 Answers 4

1

I'd recommend a proper CSV parser to do the job, then use awk to do the regex, e.g.

$ ruby -r 'csv' -ne 'lines=$_
  CSV.parse(lines) do |i| 
    i.each do |j| 
      printf("%s ", j)
    end
    puts ""
  end' file | 
awk '{gsub(/\[\047|\047\]|\047|\]|,/, "", $0)}
  /^host/{for(i=1;i<=NF;i++){if($i~/^[0-9]+\.+/){print $1, $i}}}'
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101
Sign up to request clarification or add additional context in comments.

2 Comments

The OPs input isn't proper CSV by any "standard" so YMMV trying to use a proper CSV parser on it.
@EdMorton Ah, the file was updated. I used the old version which had correct quoting.
0

Using awk:

awk -F",\"|\"$" 'NR>1 { \
gsub(/\047|[\[\]]/,""); \
split($2,a,", "); \
split($1,h,","); \
for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n]}}' src.csv

Output:

host1 192.168.x.100
host1 192.168.x.99
host2 192.168.x.101

Details:

-F",\"|\"$" (split on ," or " at end of record (will remove trailing double quote and each record will be split into two fields.

gsub(/\047|[\[\]]/,""); (sanitize by removing both single quotes and/or brackets)

split($2,a,", "); (split second field into array a on comma-space)

split($1,h,","); (split first field into array h on comma.

for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n] Iterate over array a and only print output if array item starts with a number

Comments

0

Modern versions of awk allow a record to split at more than one field separator. Thus, each line can be split at both commas and single quote marks to isolate the data you need.

To use ' as a field separator along with , requires the former to be escaped and it can be quite tricky to then combine the two. The simplest way I've found after a few trials is to use the shell F switch with a regular expression including the escaped ' and ,. It's messy as you have to close the first single quote before escaping the required one and re-opening a single-quotted command: -F'[,'\''=]' (I generally prefer setting field separators within the awk procedure but this one defeated me).

This edited version works to isolate the field (change $35 to suit by trail-and error):

awk -F'[,'\'']' 'NR>1{print $1" "$35}' data.csv

I tested the above on the following test file:

data.csv:
Name,...DATE,COLUMNX,ADDRESSES
host1,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.xxx.yy', 'fe80:XX','192.168.x.100', fe80:XX]"
host3,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.xxx.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host4,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['xxx.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"

output:

host1 192.168.x.99
host2 192.168.xxx.yy
host3 192.xxx.x.99
host4 xxx.168.x.99

12 Comments

Apologies, I inadvertantly used $0 instead of $1 in the original post, edited to correct.
Should the ' inside the FS statement be escaped?
No but I just realised also that the OR | is probably not needed as the square brackets mean OR already. And the quotes should probably be regex boundaries //. Really sorry, rushing and don't have time to make a test file but I think setting two delimerters will get what you need.
Unfortunately @DavePritlove, I'm just as close as with my original try: awk -F, '{print $X}'
@EdMorton thanks, I like both of those.
|
0

Using GNU awk for FPAT:

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
    n = split($NF,a,/\047/)
    for ( i=2; i<=n; i+=4 ) {
        print $1, a[i]
    }
}

$ awk -f tst.awk file
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

To see how FPAT is splitting the input into comma-separated fields and then split() is splitting the last field into '-separated subfields, add some prints, e.g.:

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
    print "============="
    print
    for ( i=1; i<=NF; i++ ) {
        print i, "<" $i ">"
    }
    n = split($NF,a,/\047/)
    for ( i=1; i<=n; i++ ) {
        print "\t" NF "." i, "<" a[i] ">"
    }
    for ( i=2; i<=n; i+=4 ) {
        print $1, a[i]
    }
}

$ awk -f tst.awk file
=============
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
1 <host1>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element1', 'Element2']>
6 <"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]">
        6.1 <"[>
        6.2 <192.168.x.99>
        6.3 <, >
        6.4 <fe80:XX>
        6.5 <,>
        6.6 <192.168.x.100>
        6.7 <, fe80:XX]">
host1 192.168.x.99
host1 192.168.x.100
=============
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"
1 <host2>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element3']>
6 <"['192.168.x.101', 'fe80:XX']">
        6.1 <"[>
        6.2 <192.168.x.101>
        6.3 <, >
        6.4 <fe80:XX>
        6.5 <]">
host2 192.168.x.101

See What's the most robust way to efficiently parse CSV using awk? for more information on FPAT and parsing CSVs with awk.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.