I have a data table called iso:
> iso
variant_id transcript_id is_NL counts nrows
1: chr10_129450960_T_C_b38 chr10_129467297_129536240 0 33029 458
2: chr10_129450960_T_C_b38 chr10_129467297_129536240 1 3477 54
3: chr10_129450960_T_C_b38 chr10_129467297_129536240 2 130 3
4: chr10_129450960_T_C_b38 chr10_129536378_129563778 0 51 458
5: chr10_129450960_T_C_b38 chr10_129536378_129563778 1 8 54
---
500148: chr9_34699703_G_C_b38 chr9_34649082_34649409 1 4214 57
500149: chr9_34699703_G_C_b38 chr9_34649082_34649409 2 171 2
500150: chr9_34699703_G_C_b38 chr9_34649565_34650368 0 48713 456
500151: chr9_34699703_G_C_b38 chr9_34649565_34650368 1 4932 57
500152: chr9_34699703_G_C_b38 chr9_34649565_34650368 2 208 2
I filtered it such that for each row, when is_NL == 0, only include the row if counts/nrows < 50 or when is_NL == c(1, 2), only include the row if counts/nrows < 50:
> iso[with(iso, (is_NL == 0 & counts/nrows < 50) |
+ (is_NL %in% c(1,2) & counts/nrows > 50)),]
variant_id transcript_id is_NL counts nrows
1: chr10_129450960_T_C_b38 chr10_129467297_129536240 1 3477 54
2: chr10_129450960_T_C_b38 chr10_129536378_129563778 0 51 458
3: chr10_129450960_T_C_b38 chr10_129536378_129707894 1 3847 54
4: chr10_129450960_T_C_b38 chr10_129701913_129707894 0 188 458
5: chr10_129450960_T_C_b38 chr10_129708044_129715519 0 17 458
---
198076: chr9_34699703_G_C_b38 chr9_34648908_34648997 0 611 456
198077: chr9_34699703_G_C_b38 chr9_34649082_34649409 1 4214 57
198078: chr9_34699703_G_C_b38 chr9_34649082_34649409 2 171 2
198079: chr9_34699703_G_C_b38 chr9_34649565_34650368 1 4932 57
198080: chr9_34699703_G_C_b38 chr9_34649565_34650368 2 208 2
However, now I realized that I only want to include rows whose other instances of matching variant_id and transcript_id meet that criteria. For example:
500150: chr9_34699703_G_C_b38 chr9_34649565_34650368 0 48713 456
500151: chr9_34699703_G_C_b38 chr9_34649565_34650368 1 4932 57
500152: chr9_34699703_G_C_b38 chr9_34649565_34650368 2 208 2
The above demonstrates what I mean. The variant_id and transcript_id pairs, for each value of is_NL, meets the criteria of either counts/nrows < 50 (when is_NL == 0) or counts/nrows > 50 (when is_NL == c(1, 2))
198077: chr9_34699703_G_C_b38 chr9_34649082_34649409 1 4214 57
198078: chr9_34699703_G_C_b38 chr9_34649082_34649409 2 171 2
The above is an example of what I do not want. Both rows have matching variant_id and transcript_id values and the correct value for counts/nrows, but the row containing is_NL == 0 is missing presumably because, for that row, counts/nrows !< 50.
I hope I have made myself clear. I just want instances where variant_id and transcript_id match, and counts/nrows for each value of is_NL is either < 50 if is_NL == 0 and > 50 if is_NL == c(1,2).
If this is done correctly, I should have triplets of variant_id and transcript_id combinations, and each triplet should have an is_NL value of either 0, 1 or 2.