0

My objective here was to iterate across each column in a df and then for each column iterate down each row and perform a function. The specific function in this case replaces the NA values with the corresponding value in the final column, but the details of the function required are not relevant to the question here. I got the results I needed using two nested for loops like this:

for (j in 1:ncol(df.i)) {
  for (i in 1:nrow(df.i)) {
    df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
  }
}

However, I believe this should be possible using an apply(df.i, 1, function) nested within an apply(df.i, 2, function) But I'm not totally sure that is possible or how to do it. Does anyone know how to achieve the same thing with a nested use of the apply function?

2
  • ifelse is a vectorized function, so your inner loop can be replaced with: df.i[,j] <- ifelse(is.na(df.i[,j]), df.i[,39], df.i[,j]). This can now be used in your apply function. Commented Aug 1, 2018 at 15:35
  • 2
    Beware when using apply() with data.frames. apply() coerces the data.frame to matrix where all columns are of the same data type. This seems not to be an issue in your particular case but in general it is safer to use lapply(). Commented Aug 1, 2018 at 15:40

1 Answer 1

2

Here are four ways to do what the inner instruction does.

First, a dataset example.

set.seed(5345)    # Make the results reproducible
df.i <- matrix(1:400, ncol = 40)
is.na(df.i) <- sample(400, 50)

Now, the comment by @Dave2e: just one for loop, vectorize the inner most one.

df.i2 <- df.i3 <- df.i1 <- df.i    # Work with copies

for (j in 1:ncol(df.i1)) {
  df.i1[,j] <- ifelse(is.na(df.i1[, j]), df.i1[, 39], df.i1[, j])
}

Then, vectorized, no loops at all.

df.i2 <- ifelse(is.na(df.i), df.i[, 39], df.i)

Another vectorized, by @Gregor in the comment, much better since ifelse is known to be relatively slow.

df.i3[is.na(df.i3)] <- df.i3[row(df.i3)[is.na(df.i3)], 39]

And your solution, as posted in the question.

for (j in 1:ncol(df.i)) {
  for (i in 1:nrow(df.i)) {
    df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
  }
}

Compare the results.

identical(df.i, df.i1)
#[1] TRUE

identical(df.i, df.i2)
#[1] TRUE

identical(df.i, df.i3)
#[1] TRUE

Benchmarks.

After the comment by @Gregor I have decided to benchmark the 4 solutions. As expected each optimization gives a significant seep up and his fully vectorized solution is the fastest.

f <- function(df.i){
  for (j in 1:ncol(df.i)) {
    for (i in 1:nrow(df.i)) {
      df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
    }
  }
  df.i
}

f1 <- function(df.i1){
  for (j in 1:ncol(df.i1)) {
    df.i1[,j] <- ifelse(is.na(df.i1[, j]), df.i1[, 39], df.i1[, j])
  }
  df.i1
}

f2 <- function(df.i2){
  df.i2 <- ifelse(is.na(df.i2), df.i2[, 39], df.i2)
  df.i2
}

f3 <- function(df.i3){
  df.i3[is.na(df.i3)] <- df.i3[row(df.i3)[is.na(df.i3)], 39]
  df.i3
}

microbenchmark::microbenchmark(
  two_loops = f(df.i),
  one_loop = f1(df.i1),
  ifelse = f2(df.i2),
  vectorized = f3(df.i3)
)
#Unit: microseconds
#      expr      min        lq       mean    median       uq      max neval
# two_loops 1125.017 1143.4995 1226.93089 1152.5665 1190.599 5209.431   100
#  one_loop  492.945  500.7045  518.73060  504.9435  516.638  678.951   100
#    ifelse   42.269   45.7770   50.55519   48.4140   50.470  198.533   100
#vectorized   12.626   14.5520   16.21975   15.6380   17.663   27.525   100
Sign up to request clarification or add additional context in comments.

3 Comments

Fully vectorized, no ifelse: df.i3[is.na(df.i3)] = df.i3[row(df.i3)[is.na(df.i3)], 39]
(And for loop, no ifelse, replace the inner line with df.i1[is.na(df.i1[, j]), j] <- df.i1[is.na(df.i1[, j]), 39])
@Gregor Thanks, I have used the first comment, see the edit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.