3

Nature of the Problem:

I have a CSV file with 10 columns, of which 4 columns specify codes for diseases. Let us say that these are columns 1 - 4. I have 2 text files that contain "inclusion" and "exclusion" codes.

The inclusion file is as follows: a file with n input strings, each on newlines

Example:

123
12300
12301
124
12400
12401
1250

The exclusion file is as follows: a file with m input strings, each on newlines as well.

Example:

456
457
458
459

A truncated version of the CSV file would look like the following:

D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
890,001,456,0009,A2,B2,C2,D2,E2,F2
12301,456,00,145,A3,B3,C3,D3,E3,F3
567,1250,010,321,A4,B4,C4,D4,E4,F4

Using AWK, how can I take 2 files called inclusion and exclusion and the CSV file, that returns the following:

D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

The CSV file can have millions of lines, while the inclusion and exclusion files can have dozens of lines. This is not a homework assignment, and I appreciate the help.

4
  • What happens if a line has both a field that matches inclusion and one that matches exclusion? What happens if it has neither? What have you tried so far? Commented Jul 7, 2015 at 2:16
  • Exclusion takes precedence. That's why the 3rd line is left out. Sorry for not making that clear. Commented Jul 7, 2015 at 2:21
  • Very good. And my other two questions? Commented Jul 7, 2015 at 2:22
  • If it doesn't match, then the line is excluded. Up until this point I have been doing this by hardcoding specific strings into an awk line. Commented Jul 7, 2015 at 2:34

1 Answer 1

3

Using grep

$ head -n1 <file; grep -E "(^|,)($(tr '\n' '|' <inclusion))(,|$)" file | grep -Ev "(^|,)($(tr '\n' '|' <exclusion))(,|$)"
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

Using awk

$ awk -v inc="(^|,)($(tr '\n' '|' <inclusion))(,|$)" -v exc="(^|,)($(tr '\n' '|' <exclusion))(,|$)" 'NR==1 || ($0 ~ inc && ! ($0 ~ exc))' file
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

How it works

For both the grep and awk solutions, the key step is the creation of a regular expression that matches on either the inclusion or exclusion files. Because it is shorter, let's take exclusion as an example. We can create a regex for it as follows:

$ echo "(^|,)($(tr '\n' '|' <exclusion))(,|$)"
(^|,)(456|457|458|459|)(,|$)

The regex for inclusion works analogously. Once the include and exclude regexes have been created, we can use them either with grep or with awk. If using awk, we use the condition:

NR==1 || ($0 ~ inc && ! ($0 ~ exc))

If this condition is true then awk performs its default action which is to print the line. The condition is true if (1) we are on the first line, NR==1 or if (2) the line matches in the regex for inclusion, inc, and does not match the regex for exclusion, exc.

Alternate awk solution

$ gawk -F, -v inc="$(<inclusion)" -v exc="$(<exclusion)" 'BEGIN{n=split(inc,x,"\n"); for (j=1;j<=n;j++)incl[x[j]]=1; n=split(exc,x,"\n"); for (j=1;j<=n;j++)excl[x[j]]=1;} NR==1{print;next} {p=0;for (j=1;j<=NF;j++) if ($j in incl)p=1; for (j=1;j<=NF;j++) if ($j in excl) p=0;} p' file
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

The same code written out over multiple lines looks like:

gawk -F, -v inc="$(<inclusion)" -v exc="$(<exclusion)" '
BEGIN{
    n=split(inc,x,"\n")
    for (j=1;j<=n;j++)incl[x[j]]=1
    n=split(exc,x,"\n")
    for (j=1;j<=n;j++)excl[x[j]]=1
}
NR==1{
    print
    next
} 

{
    p=0
    for (j=1;j<=NF;j++) if ($j in incl) p=1
    for (j=1;j<=NF;j++) if ($j in excl) p=0
}
p
' file

The above creates array incl and excl with the inclusion and exclusion data. Any line with a field in incl is marked for printing p=1. If however the line contains a field in excl, then p is set to false, p=0.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you John1024. Does that code look at all the columns for the inclusion/exclusion criteria, or a specific few?
When I run this code on awk (OSX version 20070501), it fails. When I run this on gawk (v4.1.3), it omits the second line in the test files given above. When I run this on mawk 1.3.4, it also omits the second line. Is there a known issue in the different versions of awk interpreting regexes?
@oort It is designed to look at all columns from first through last. And, yes, BSD (OSX) and GNU awk have many subtle and annoying incompatibilities. Your gawk version, however, is nearly identical to mine (4.1.1) and I do not see why you would get different results from me on that. To verify, I just copied and pasted the command from the answer here to my terminal and got the correct result including the second line.
@oort I just added a new and very awk solution. It does not use regex at all. Let me know if it works for you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.