1

I ran into a problem with finding the best syntax to filter out cases where I want to filter all rows where a group of variables has at least one value non null.

An example is below:

> dat <- data.table(a=1:5, b=c(1:3, NA, NA), c=c(NA, 1:3, NA))

> cols <- c('b', 'c')

> dat[!all(is.na(cols)), .SD, with=FALSE]
Null data.table (0 rows and 0 cols)

> dat[!is.na(b)|!is.na(c), .SD]
   a  b  c
1: 1  1 NA
2: 2  2  1
3: 3  3  2
4: 4 NA  3

As you can see if I explicitly say each variable name as !is.na(variable1) | !is.na(variable2) it works however I can't find a way to include a group of variables so I can do it in 1 condition only and not concatenate everything with or.

0

2 Answers 2

3

You can use the following syntax with rowSums and .SD:

dat[dat[, rowSums(!is.na(.SD)) > 0, .SDcols  = cols]]
#   a  b  c
#1: 1  1 NA
#2: 2  2  1
#3: 3  3  2
#4: 4 NA  3

The inner part creates a logical value that looks like this:

dat[, rowSums(!is.na(.SD)) > 0, .SDcols  = cols]
# [1]  TRUE  TRUE  TRUE  TRUE FALSE

Re the comment by Michael, you can also use Reduce + lapply:

dat[dat[, Reduce("+", lapply(.SD, function(x) !is.na(x))) > 0, .SDcols = cols]]

But for most of my use cases, the rowSums approach is ok and easier to read, imo.

Sign up to request clarification or add additional context in comments.

1 Comment

works, but matrix conversion is expensive on large data. there's a Reduce approach but I'm AFK to test. Something like do.call(`+`, lapply(.SD, is.na)) could also work
0

You could also do this.

dat[rowSums(!is.na(dat[, cols, with=FALSE])) > 0,]
    a  b  c
#1: 1  1 NA
#2: 2  2  1
#3: 3  3  2
#4: 4 NA  3

1 Comment

Well, this doesn't fulfill OP's requirements of using a vector of relevant column names ("cols").

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.