Using strings from several input files as search criteria for select columns in a CSV file using AWK

Question

Nature of the Problem:

I have a CSV file with 10 columns, of which 4 columns specify codes for diseases. Let us say that these are columns 1 - 4. I have 2 text files that contain "inclusion" and "exclusion" codes.

The inclusion file is as follows: a file with n input strings, each on newlines

Example:

The exclusion file is as follows: a file with m input strings, each on newlines as well.

Example:

A truncated version of the CSV file would look like the following:

D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
890,001,456,0009,A2,B2,C2,D2,E2,F2
12301,456,00,145,A3,B3,C3,D3,E3,F3
567,1250,010,321,A4,B4,C4,D4,E4,F4

Using AWK, how can I take 2 files called inclusion and exclusion and the CSV file, that returns the following:

D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

The CSV file can have millions of lines, while the inclusion and exclusion files can have dozens of lines. This is not a homework assignment, and I appreciate the help.

What happens if a line has both a field that matches inclusion and one that matches exclusion? What happens if it has neither? What have you tried so far? — John1024
– John1024, Commented Jul 7, 2015 at 2:16
Exclusion takes precedence. That's why the 3rd line is left out. Sorry for not making that clear. — oort
– oort, Commented Jul 7, 2015 at 2:21
If it doesn't match, then the line is excluded. Up until this point I have been doing this by hardcoding specific strings into an awk line. — oort
– oort, Commented Jul 7, 2015 at 2:34

John1024 · Accepted Answer · 2015-07-07 17:54:51Z

3

Using grep

$ head -n1 <file; grep -E "(^|,)($(tr '\n' '|' <inclusion))(,|$)" file | grep -Ev "(^|,)($(tr '\n' '|' <exclusion))(,|$)"
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

Using awk

$ awk -v inc="(^|,)($(tr '\n' '|' <inclusion))(,|$)" -v exc="(^|,)($(tr '\n' '|' <exclusion))(,|$)" 'NR==1 || ($0 ~ inc && ! ($0 ~ exc))' file
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

How it works

For both the grep and awk solutions, the key step is the creation of a regular expression that matches on either the inclusion or exclusion files. Because it is shorter, let's take exclusion as an example. We can create a regex for it as follows:

$ echo "(^|,)($(tr '\n' '|' <exclusion))(,|$)"
(^|,)(456|457|458|459|)(,|$)

The regex for inclusion works analogously. Once the include and exclude regexes have been created, we can use them either with grep or with awk. If using awk, we use the condition:

NR==1 || ($0 ~ inc && ! ($0 ~ exc))

If this condition is true then awk performs its default action which is to print the line. The condition is true if (1) we are on the first line, NR==1 or if (2) the line matches in the regex for inclusion, inc, and does not match the regex for exclusion, exc.

Alternate awk solution

$ gawk -F, -v inc="$(<inclusion)" -v exc="$(<exclusion)" 'BEGIN{n=split(inc,x,"\n"); for (j=1;j<=n;j++)incl[x[j]]=1; n=split(exc,x,"\n"); for (j=1;j<=n;j++)excl[x[j]]=1;} NR==1{print;next} {p=0;for (j=1;j<=NF;j++) if ($j in incl)p=1; for (j=1;j<=NF;j++) if ($j in excl) p=0;} p' file
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

The same code written out over multiple lines looks like:

gawk -F, -v inc="$(<inclusion)" -v exc="$(<exclusion)" '
BEGIN{
    n=split(inc,x,"\n")
    for (j=1;j<=n;j++)incl[x[j]]=1
    n=split(exc,x,"\n")
    for (j=1;j<=n;j++)excl[x[j]]=1
}
NR==1{
    print
    next
} 

{
    p=0
    for (j=1;j<=NF;j++) if ($j in incl) p=1
    for (j=1;j<=NF;j++) if ($j in excl) p=0
}
p
' file

The above creates array incl and excl with the inclusion and exclusion data. Any line with a field in incl is marked for printing p=1. If however the line contains a field in excl, then p is set to false, p=0.

edited Jul 7, 2015 at 17:54

answered Jul 7, 2015 at 3:08

John1024

115k15 gold badges152 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

oort Over a year ago

Thank you John1024. Does that code look at all the columns for the inclusion/exclusion criteria, or a specific few?

oort Over a year ago

When I run this code on awk (OSX version 20070501), it fails. When I run this on gawk (v4.1.3), it omits the second line in the test files given above. When I run this on mawk 1.3.4, it also omits the second line. Is there a known issue in the different versions of awk interpreting regexes?

John1024 Over a year ago

@oort It is designed to look at all columns from first through last. And, yes, BSD (OSX) and GNU awk have many subtle and annoying incompatibilities. Your gawk version, however, is nearly identical to mine (4.1.1) and I do not see why you would get different results from me on that. To verify, I just copied and pasted the command from the answer here to my terminal and got the correct result including the second line.

John1024 Over a year ago

@oort I just added a new and very awk solution. It does not use regex at all. Let me know if it works for you.

Collectives™ on Stack Overflow

Using strings from several input files as search criteria for select columns in a CSV file using AWK

1 Answer 1

Using grep

Using awk

How it works

Alternate awk solution

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Using grep

Using awk

How it works

Alternate awk solution

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related