Mystery bug in sampling for loop in R

Question

I am trying to understand what is causing this bug in my R code and I feel like R is gaslighting me.

The sample() function seems to change depending on how I assign it?

Anyways, here is the MRE:

#Sampling Bug MRE
rm(list = ls())
library(tidyverse)
ages=c(paste0("CHILD",seq(1,10),"AGE"))
set.seed(26)
df=c()
for(i in 1:10){
  x=round(runif(1:100,min=1,max=20),0)
  df = as.data.frame(cbind(df,x))
}
names(df)=ages

set.seed(26)
df$`Sampled Child`=0
test_vector=c()
for(i in 1:nrow(df)){
  childs_age = unlist(c(as.numeric(df[i,ages])))
  slice=which(childs_age<=17)
  if(length(slice)>=1){
    df$`Sampled Child`[i]=sample(x=slice,size=1,replace = F)
    test_vector=append(test_vector,sample(x=slice,size=1,replace = F))
  }
  else{
    df$`Sampled Child`[i]="Ineligibile"
    test_vector=append(test_vector,"Ineligibile")
  }
}
df$test=test_vector
sum(df$`Sampled Child`==df$test)

I just need someone to explain why assigning the value with df$Sampled Child[i] is assigning a different number than just appending it to a vector?

TIA!

I am trying to sample a child who is less than 17 years old only. Once I know which children are less than 17, I pick one at randomly. If there are no children less than 17, they are ineligible.

Ben Bolker · Accepted Answer · 2024-05-30 16:02:56Z

6

You're getting different answers because you're calling sample() twice.

If your code instead looked like this:

 if(length(slice)>=1){
    cur_samp <- sample(x=slice,size=1,replace = FALSE)
    df$`Sampled Child`[i] <- cur_samp
    test_vector=append(test_vector,cur_samp)
  }

then the two results should be equal.

For what it's worth, growing data frames and vectors by repeatedly appending to them (or inserting into positions beyond the end of the vector) is inefficient in R; it's the second circle of the R Inferno. It would be better to create a vector of the appropriate length (e.g. filled with NA values) first, then assign to appropriate elements as you go.

edited May 30, 2024 at 16:02

answered May 30, 2024 at 15:36

Ben Bolker

230k26 gold badges404 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

ssm1020 Over a year ago

Thank you for your quick response! I will have to read that article you linked. I am working with pretty small datasets (less than 100 at a time), but I do agree its inefficient. I do have one follow up question though, the slice doesn't seem to be working properly in my code but it does work for the MRE even though I just copied it over. There are instances on my end where a child that is older than 17 is being selected. Is there anything you see that would make you believe this is happening? There are NA values for in my version of the dataset, as not every household has 10 children.

Ben Bolker Over a year ago

I agree that it probably doesn't matter in your case, but it's good not to get in the habit.

Ben Bolker Over a year ago

You might want to use NA rather than "Ineligible" as your result when there are no children < 17, so that the whole column remains numeric (otherwise all the values will get coerced to character type)

Ben Bolker Over a year ago

If this solved your problem you're encouraged to click the check-mark to accept it ...

Ben Bolker Over a year ago

Yes, this is a classic sample() pitfall: stackoverflow.com/questions/13990125/…

|

Collectives™ on Stack Overflow

Mystery bug in sampling for loop in R

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related