Here are four ways to do what the inner instruction does.
First, a dataset example.
set.seed(5345) # Make the results reproducible
df.i <- matrix(1:400, ncol = 40)
is.na(df.i) <- sample(400, 50)
Now, the comment by @Dave2e: just one for loop, vectorize the inner most one.
df.i2 <- df.i3 <- df.i1 <- df.i # Work with copies
for (j in 1:ncol(df.i1)) {
df.i1[,j] <- ifelse(is.na(df.i1[, j]), df.i1[, 39], df.i1[, j])
}
Then, vectorized, no loops at all.
df.i2 <- ifelse(is.na(df.i), df.i[, 39], df.i)
Another vectorized, by @Gregor in the comment, much better since ifelse is known to be relatively slow.
df.i3[is.na(df.i3)] <- df.i3[row(df.i3)[is.na(df.i3)], 39]
And your solution, as posted in the question.
for (j in 1:ncol(df.i)) {
for (i in 1:nrow(df.i)) {
df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
}
}
Compare the results.
identical(df.i, df.i1)
#[1] TRUE
identical(df.i, df.i2)
#[1] TRUE
identical(df.i, df.i3)
#[1] TRUE
Benchmarks.
After the comment by @Gregor I have decided to benchmark the 4 solutions. As expected each optimization gives a significant seep up and his fully vectorized solution is the fastest.
f <- function(df.i){
for (j in 1:ncol(df.i)) {
for (i in 1:nrow(df.i)) {
df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
}
}
df.i
}
f1 <- function(df.i1){
for (j in 1:ncol(df.i1)) {
df.i1[,j] <- ifelse(is.na(df.i1[, j]), df.i1[, 39], df.i1[, j])
}
df.i1
}
f2 <- function(df.i2){
df.i2 <- ifelse(is.na(df.i2), df.i2[, 39], df.i2)
df.i2
}
f3 <- function(df.i3){
df.i3[is.na(df.i3)] <- df.i3[row(df.i3)[is.na(df.i3)], 39]
df.i3
}
microbenchmark::microbenchmark(
two_loops = f(df.i),
one_loop = f1(df.i1),
ifelse = f2(df.i2),
vectorized = f3(df.i3)
)
#Unit: microseconds
# expr min lq mean median uq max neval
# two_loops 1125.017 1143.4995 1226.93089 1152.5665 1190.599 5209.431 100
# one_loop 492.945 500.7045 518.73060 504.9435 516.638 678.951 100
# ifelse 42.269 45.7770 50.55519 48.4140 50.470 198.533 100
#vectorized 12.626 14.5520 16.21975 15.6380 17.663 27.525 100
ifelseis a vectorized function, so your inner loop can be replaced with:df.i[,j] <- ifelse(is.na(df.i[,j]), df.i[,39], df.i[,j]). This can now be used in your apply function.apply()with data.frames.apply()coerces the data.frame to matrix where all columns are of the same data type. This seems not to be an issue in your particular case but in general it is safer to uselapply().