How do I filter multiple rows with matching column values based on all rows meeting a certain condition? [R]

Question

I have a data table called iso:

> iso
                     variant_id             transcript_id is_NL counts nrows
     1: chr10_129450960_T_C_b38 chr10_129467297_129536240     0  33029   458
     2: chr10_129450960_T_C_b38 chr10_129467297_129536240     1   3477    54
     3: chr10_129450960_T_C_b38 chr10_129467297_129536240     2    130     3
     4: chr10_129450960_T_C_b38 chr10_129536378_129563778     0     51   458
     5: chr10_129450960_T_C_b38 chr10_129536378_129563778     1      8    54
    ---
500148:   chr9_34699703_G_C_b38    chr9_34649082_34649409     1   4214    57
500149:   chr9_34699703_G_C_b38    chr9_34649082_34649409     2    171     2
500150:   chr9_34699703_G_C_b38    chr9_34649565_34650368     0  48713   456
500151:   chr9_34699703_G_C_b38    chr9_34649565_34650368     1   4932    57
500152:   chr9_34699703_G_C_b38    chr9_34649565_34650368     2    208     2

I filtered it such that for each row, when is_NL == 0, only include the row if counts/nrows < 50 or when is_NL == c(1, 2), only include the row if counts/nrows < 50:

> iso[with(iso, (is_NL == 0 & counts/nrows < 50) |
+                 (is_NL %in% c(1,2) & counts/nrows > 50)),]
                     variant_id             transcript_id is_NL counts nrows
     1: chr10_129450960_T_C_b38 chr10_129467297_129536240     1   3477    54
     2: chr10_129450960_T_C_b38 chr10_129536378_129563778     0     51   458
     3: chr10_129450960_T_C_b38 chr10_129536378_129707894     1   3847    54
     4: chr10_129450960_T_C_b38 chr10_129701913_129707894     0    188   458
     5: chr10_129450960_T_C_b38 chr10_129708044_129715519     0     17   458
    ---
198076:   chr9_34699703_G_C_b38    chr9_34648908_34648997     0    611   456
198077:   chr9_34699703_G_C_b38    chr9_34649082_34649409     1   4214    57
198078:   chr9_34699703_G_C_b38    chr9_34649082_34649409     2    171     2
198079:   chr9_34699703_G_C_b38    chr9_34649565_34650368     1   4932    57
198080:   chr9_34699703_G_C_b38    chr9_34649565_34650368     2    208     2

However, now I realized that I only want to include rows whose other instances of matching variant_id and transcript_id meet that criteria. For example:

500150:   chr9_34699703_G_C_b38    chr9_34649565_34650368     0  48713   456
500151:   chr9_34699703_G_C_b38    chr9_34649565_34650368     1   4932    57
500152:   chr9_34699703_G_C_b38    chr9_34649565_34650368     2    208     2

The above demonstrates what I mean. The variant_id and transcript_id pairs, for each value of is_NL, meets the criteria of either counts/nrows < 50 (when is_NL == 0) or counts/nrows > 50 (when is_NL == c(1, 2))

198077:   chr9_34699703_G_C_b38    chr9_34649082_34649409     1   4214    57
198078:   chr9_34699703_G_C_b38    chr9_34649082_34649409     2    171     2

The above is an example of what I do not want. Both rows have matching variant_id and transcript_id values and the correct value for counts/nrows, but the row containing is_NL == 0 is missing presumably because, for that row, counts/nrows !< 50.

I hope I have made myself clear. I just want instances where variant_id and transcript_id match, and counts/nrows for each value of is_NL is either < 50 if is_NL == 0 and > 50 if is_NL == c(1,2).

If this is done correctly, I should have triplets of variant_id and transcript_id combinations, and each triplet should have an is_NL value of either 0, 1 or 2.

Brunox13 · Accepted Answer · 2019-12-14 07:47:32Z

2

Try the following:

library(dplyr)

iso <- setDT(iso)[with(iso, (is_NL == 0 & counts/nrows < 50) | (is_NL %in% c(1,2) & counts/nrows > 50)),][, triplet := .N, by = .(variant_id, transcript_id)][triplet == 3, ][, triplet := NULL]

It creates a temporary variable and selects only those rows which create needed triplets.

edited Dec 14, 2019 at 7:47

Brunox13

8731 gold badge8 silver badges22 bronze badges

answered Dec 13, 2019 at 20:08

Grzegorz Sionkowski

5293 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

CelineDion Over a year ago

Hmm.. it's not giving me the desired result. Should I try that line after iso[with(iso, (is_NL == 0 & counts/nrows < 50) | (is_NL %in% c(1,2) & counts/nrows > 50)),]?

Grzegorz Sionkowski Over a year ago

Yes, I did not wanted to change anything of your concept, just add something.

Brunox13 Over a year ago

The answer has now been edited and should work as is.

Collectives™ on Stack Overflow

How do I filter multiple rows with matching column values based on all rows meeting a certain condition? [R]

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related