1

I am trying to understand what is causing this bug in my R code and I feel like R is gaslighting me.

The sample() function seems to change depending on how I assign it?

Anyways, here is the MRE:

#Sampling Bug MRE
rm(list = ls())
library(tidyverse)
ages=c(paste0("CHILD",seq(1,10),"AGE"))
set.seed(26)
df=c()
for(i in 1:10){
  x=round(runif(1:100,min=1,max=20),0)
  df = as.data.frame(cbind(df,x))
}
names(df)=ages

set.seed(26)
df$`Sampled Child`=0
test_vector=c()
for(i in 1:nrow(df)){
  childs_age = unlist(c(as.numeric(df[i,ages])))
  slice=which(childs_age<=17)
  if(length(slice)>=1){
    df$`Sampled Child`[i]=sample(x=slice,size=1,replace = F)
    test_vector=append(test_vector,sample(x=slice,size=1,replace = F))
  }
  else{
    df$`Sampled Child`[i]="Ineligibile"
    test_vector=append(test_vector,"Ineligibile")
  }
}
df$test=test_vector
sum(df$`Sampled Child`==df$test)

I just need someone to explain why assigning the value with df$Sampled Child[i] is assigning a different number than just appending it to a vector?

TIA!

I am trying to sample a child who is less than 17 years old only. Once I know which children are less than 17, I pick one at randomly. If there are no children less than 17, they are ineligible.

1 Answer 1

6

You're getting different answers because you're calling sample() twice.

If your code instead looked like this:

 if(length(slice)>=1){
    cur_samp <- sample(x=slice,size=1,replace = FALSE)
    df$`Sampled Child`[i] <- cur_samp
    test_vector=append(test_vector,cur_samp)
  }

then the two results should be equal.

For what it's worth, growing data frames and vectors by repeatedly appending to them (or inserting into positions beyond the end of the vector) is inefficient in R; it's the second circle of the R Inferno. It would be better to create a vector of the appropriate length (e.g. filled with NA values) first, then assign to appropriate elements as you go.

Sign up to request clarification or add additional context in comments.

7 Comments

Thank you for your quick response! I will have to read that article you linked. I am working with pretty small datasets (less than 100 at a time), but I do agree its inefficient. I do have one follow up question though, the slice doesn't seem to be working properly in my code but it does work for the MRE even though I just copied it over. There are instances on my end where a child that is older than 17 is being selected. Is there anything you see that would make you believe this is happening? There are NA values for in my version of the dataset, as not every household has 10 children.
I agree that it probably doesn't matter in your case, but it's good not to get in the habit.
You might want to use NA rather than "Ineligible" as your result when there are no children < 17, so that the whole column remains numeric (otherwise all the values will get coerced to character type)
If this solved your problem you're encouraged to click the check-mark to accept it ...
Yes, this is a classic sample() pitfall: stackoverflow.com/questions/13990125/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.