0

I would be grateful for your help with the following.

I have the following file (file.txt), which is about 10,000 lines long:

ID1  ID2  0  1  0.5  0.6
ID3  ID4  0  0  0.4  0.8
ID1  ID5  0  1  0.5  0.3
ID6  ID2  1  0  0.4  0.8

The IDs in the first two columns can occur between 1 to 10 times in the file (in either column 1 or column 2).

What I want to achieve:

I want to scan this file line by line, and print IDs to an ever-growing exclusion list if they meet the following criteria:

My criteria are follows:

If $3 > $4, print $2 (ID2) to exclusionlist.txt
If $3 < $4, print $1 (ID1) to exclusionlist.txt
If $3 = $4 and $5 < $6, print $2 (ID2) to exclusionlist.txt
If $3 = $4 and $5 > $6, print $1 (ID1) to exclusionlist.txt

So applying this to row 1, either ID1 should be in my exclusionlist, given that $3 < $4.

I then want to delete all lines in the file where that ID from the exclusion list appears. (This can be up to 10 rows).

The output for file.txt once row 1 has been scanned should look like:

ID3 ID4 0 0 0.4 0.8
ID6 ID2 1 0 0.4 0.8

And exclusionlist.txt: ID1

I then want to start again at the new row 1 (becuase the original row 1 will have been deleted by definition), and execute the same process, but keep adding my exclusion from the new row 1 to the same exclusion list.

This is what have tried. It has meant having to rename file.txt to 1.txt

#! bin/bash
for i in {1..5000}
do
awk 'NR==1{print;}' $i.txt
awk '{if ($3>$4 || $3==$4 && $5<$6) print $2;}' $i.txt >      exclusionlist_$i.txt
awk '{if ($3>$4 || $3==$4 && $5>$6) print $1;}' $i.txt >>    exclusionlist_$i.txt
grep -v -f exclusionlist_$i.txt $i.txt > $((i+1)).txt
rm $i.txt
done

Due to my poor scripting skills, I am having to: (1) rename my file after each loop in order for it to be continuously executed, and (2) ending up with a new exclusion list per loop, rather than a single 'master' exclusion list - I can easily concatenate them all at the end, so this is not a major problem, but messy.

The problem I have is that this command seems to scan through the whole file (rather than just line 1), creating a long exclusion list just from the first run.

Any help/suggestions would be greatly appreciated.

Thank you.

GB

2
  • according to your criteria, the only lines that should stay are where $3 == $4 && $5 == $6 Commented Aug 11, 2017 at 18:25
  • @GB44444 read what to do after getting solution meta.stackexchange.com/questions/5234/… Commented Sep 19, 2017 at 13:05

1 Answer 1

1

I didn't understand why you need to do this in multiple steps. Eventually, all the lines will be deleted and you'll only get the exclusion list.

For example, this will do the same in one pass

$ awk '!($1 in exc) && !($2 in exc){f=($3>$4 || $3==$4 && $5<$6)?2:1; 
                                    print $f > "exclusion.list"; exc[$f]}' file

$ cat exclusion.list
ID1
ID4
ID2

since the only outcome is the exclusion list, you can print it to stdout

$ awk '!($1 in exc) && !($2 in exc){f=($3>$4 || $3==$4 && $5<$6)?2:1; 
                                    print $f; exc[$f]}' file  > exclusion.list          

and redirect to a file.

Or, perhaps I misunderstood the problem. Note also that $3==$4 && $5==$6 condition is not defined in your spec. Perhaps that's what you're after?! If so, create the sample data with this critical case and indicate what needs to happen.

Sign up to request clarification or add additional context in comments.

1 Comment

That seems to work very well. Thank you very much indeed! (N.B. $3==$4 && $5==$6 doesn't occur in the file).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.