2

I have a large data frame in R with over 200 mostly character variables that I would like to add factors for. I have prepared all levels and labels in an separate data frame. For a certain variable Var1, the corresponding levels and labels are Var1_v and Var1_b, for example for the variable Gender the levels and labels are named Gender_v and Gender_l.

Here is an example of my data:

df <- data.frame (Gender = c("2","2","1","2"),
                  AgeG = c("3","1","4","2"))

fct <- data.frame (Gender_v  = c("1", "2"),
                  Gender_b = c("Male", "Female"),
                  AgeG_v = c("1","2","3","4"),
                  AgeG_b = c("<25","25-60","65-80",">80"))

df$Gender <- factor(df$Gender, levels = fct$Gender_v, labels = fct$Gender_b, exclude = NULL)
df$AgeG <- factor(df$AgeG, levels = fct$AgeG_v, labels = fct$AgeG_b, exclude = NULL)

Is there away to automatize the process, so that the factors (levels and labels) are applied to corresponding variables without having me doing every single one individually? I think it's done through a function probebly with pmap.

My goal is minimize the effort needed for this process. Is there a better way to prepare the labels and levels as well?

Help is much appreciated.

2
  • There is the option stringsAsFactors in the creation of data frames. This may be useful earlier in your data pipeline. The error in your example code is due to your Gender_v and AgeG_v being stored as character values instead of numerical values. Your current code works when Gender_v = c(1,2) i.e. no quotation marks. Commented Jan 20, 2022 at 21:49
  • @typewriter How should stringsAsFactors exactly help? I am not running any error in my code btw. It is just inefficient when you have to run it to over 200 variables. Commented Jan 20, 2022 at 21:57

2 Answers 2

2

I solved it with a simple refactoring of your code, automatizing thought a loop. The more data you add, the better your time spent. I believe this fct[[paste0(names(df[i]),"_v")]] can be refactored in an small function to look even better

> df <- data.frame (Gender = c("2","2","1","2"),
+                   AgeG = c("3","1","4","2"))
> 
> fct <- data.frame (Gender_v  = c("1", "2"),
+                    Gender_b = c("Male", "Female"),
+                    AgeG_v = c("1","2","3","4"),
+                    AgeG_b = c("<25","25-60","65-80",">80"))
> 
> for(i in 1:ncol(df)){
+   
+   le <- fct[[paste0(names(df[i]),"_v")]]
+   
+   la <- fct[[paste0(names(df[i]),"_b")]]
+   
+   df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+   
+ }
> 
> df
  Gender  AgeG
1 Female 65-80
2 Female   <25
3   Male   >80
4 Female 25-60
>

Edit: Here is the if condition added


> df <- data.frame (Gender_f = c("2","2","1","2"),
+                             AgeG_f = c("3","1","4","2"),
+                   AgeN = c(70,15,96,30))
> 
> fct <- data.frame (Gender_v  = c("1", "2"),
+                                   Gender_b = c("Male", "Female"),
+                                   AgeG_v = c("1","2","3","4"),
+                                  AgeG_b = c("<25","25-60","65-80",">80"))
> 
> for(i in 1:ncol(df)){
+ 
+   if(endsWith(names(df[i]),"_f")){
+     
+     name <- str_remove(names(df[i]),"_f")
+   
+     le <- fct[[paste0(name,"_v")]]
+    
+     la <- fct[[paste0(name,"_b")]]
+      
+     df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+   
+   }
+      
+ }
> 
> df
  Gender_f AgeG_f AgeN
1   Female  65-80   70
2   Female    <25   15
3     Male    >80   96
4   Female  25-60   30
> 
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks. You methods works flawlessly with the example, but does not with my data. The problem in your code is probably the assumption that there is a factor, a level and a label for each variable. But this is not true. This turns other values in other variables into missings.
Yes, you are correct ! Just asking, if the variable is a factor, you will always have 2 entries in the fct data frame ? Because in that case, with an if condition it's solved.
Yes I guess. There should be 2 entries in the fct data frame for each variable. I was thinking about a method, that looks at the name of a variable in the df data frame, then adds "_v" and "_b" to create the factor for this variable from the fct data frame. How would you add the if condition btw. I more experienced in SAS than R.
I was wondering, what would also differ if I added a label for an NA level?
I edited it with the if condition added !
|
1

A data frame is not really an appropriate data structure for storing the factor level definitions in: there’s no reason to expect all factors to have an equal amount of levels. Rather, I’d just use a plain list, and store the level information more compactly as named vectors, along these lines:

df <- data.frame(
  Gender = c("2", "2", "1", "2"),
  AgeG = c("3", "1", "4", "2")
)

value_labels <- list(
  Gender = c("Male" = 1, "Female" = 2),
  AgeG = c("<25" = 1, "25-60" = 2, "65-80" = 3, ">80" = 4)
)

Then you can make a function that uses that data structure to make factors in a data frame:

make_factors <- function(data, value_labels) {
  for (var in names(value_labels)) {
    if (var %in% colnames(data)) {
      vl <- value_labels[[var]]
      data[[var]] <- factor(
        data[[var]],
        levels = unname(vl),
        labels = names(vl)
      )
    }
  }
  data
}

make_factors(df, value_labels)
#>   Gender  AgeG
#> 1 Female 65-80
#> 2 Female   <25
#> 3   Male   >80
#> 4 Female 25-60

1 Comment

Thanks Mikko. I have changed one thing in your code to make it easier. I have switched the positions of the levels and labels to make it for example 1 = 'Male' instead of 'Male' = 1, and changed the function accordingly levels = names(vl), labels = unname(vl).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.