0

I used the list to create 4 datasets. Now I want to list all potential ID variables in each dataset. My criteria are: 1)if this variable has over 80% unique observations; 2) If this variable does not have missing value over 30%.

To get those statistic variables, I first use skimr function in R to get a tibble containing all information, then I used filter to sift out the variables I am looking for based on the two criteria aforementioned. Here is my code:

 dfa<- dflist[[1]]%>%
      mutate_if(is.numeric,as.character)%>%
      skim()%>%
      as_tibble()%>%
      filter(character.n_unique >=nrow(dflist[[1]])*0.01)%>%
      filter(n_missing<=nrow(dflist[[1]])*0.30)

This code works fine and returns the expected variables for dataset 1. However, I have 4 different size datasets, so I am considering to integrate it into a loop code. Here is my try: First, I create a dfid list to contain the new results since I do not want the dflist is modified. Then I changed 1 in previous code in dflist[[1]] to "i". But this code does not work, the R warns that "Error in filter(., dflist[[i]][, character.n_unique] >= nrow(dflist[[1]]) * : Caused by error in [.data.frame: ! undefined columns selected".

Here is my code:

dfid<-list()
for (i in 1:4){
    dfid[[i]]<-dflist[[i]]%>%
            mutate_if(is.numeric,as.character)%>%
            skim()%>%
            as_tibble()%>%
            filter(dflist[[i]][,character.n_unique] >=nrow(dflist[[i]])*0.01)%>%
            filter(dflist[[i]][,n_missing]<=nrow(dflist[[i]])*0.30)
}

So my questions are:

  1. How to fix this error to make the goal possible?
  2. Once the dfid[[i]] has desired variables from 4 different datasets, what code I should add in to loop to combine them (4 lists) together and distinct the variable name, finally get the vector of variable names from this combined list or dataset?

Thanks a lot for your help in advance~~!

1 Answer 1

1

The columns should be quoted if we are using [ unless it is an object. It may be easier to loop with map/lapply

library(purrr)
library(dplyr)
dfid <- map(dflist, ~ .x %>% 
      mutate(across(where(is.numeric), as.character))%>%
      skim()%>%
      as_tibble()%>%
      filter(character.n_unique >= n()*0.01)%>%
      filter(n_missing <= n()*0.30))

We don't need the [ when we use the chain

dfid <- vector('list', length(dflist))
for (i in seq_along(dflist)){
    tmp <- dflist[[i]]
      dfid[[i]] <-  tmp %>%
            mutate_if(is.numeric,as.character)%>%
            skim()%>%
            as_tibble()%>%
            filter(character.n_unique >=n()*0.01)%>%
            filter(n_missing <=n()*0.30)
}
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks a lot for your answer~~~Now I got the dfid[[]] list, I use this namelist<-unique(c(dfid[[1]]$skim_variable,dfid[[2]]$skim_variable,dfid[[3]]$skim_variable,dfid[[4]]$skim_variable)) to create the unique id name list, could you teach me how to simplify this code? Like now I have to paste the dfid[[1]], dfid[[2]]... Thanks so much~~!
@Rstudyer if you want to extract the column use unique(sapply(dfid, "[[", "skim_variable)) should do it
thanks so much~~! Just one more question following that sapply code, when i use that, it returns different lists with unique variable names in each list instead of the one vector containing all unique variable names. Is there a way to use command to combine all 4 dfid lists and then unique it as a whole vector? Thanks a lot~~
@Rstudyer you may do unique(unlist(sapply(... Probably because of hte length difference, it still returned a list
it works~~~!!! Thanks so much~~!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.