R: Converting multiple binary columns into one factor variable whose factors are binary columns

Question

I was given a horrendous dataset which I am struggling to clean up: 272 (character) variables and 343 observations. It consists of a lot of binary variables that could have been summarized into one variable with multiple factors. So instead of asking "are you self employed or employed?" and given the options 1 "self employed", 2 "employed" and maybe a 3 "none/other", the set has two variables: v1.selfemployed and v2.employed with options 1 "yes" and 2 "no".

I now need to combine several binary variables into one. Since they are characters, I need to convert them into factors, which I did (see example).

### datasetdataset
v1 <- as.character(c("yes", "yes", "no", "yes", "yes", "no", "yes","no", "no", NA ))
v2 <- as.character(c("no","no","no","no","no","yes","no","yes", "no", NA))
v3 <- as.character(c("no","no", "yes", "no","no","no","no","no", "yes", NA))

df <- data.frame(v1,v2,v3)
library(tidyverse)

## dataframe -> tibble
df.t <- as_tibble(df)

## convert into 1/0 factor
df.t %>%
  mutate_if(is.character, as.factor) %>% 
  mutate_at(vars(1:3), ~fct_recode(., "1" = "yes", 
                                          "0" = "no"))

I took this route because I have many binary "bundles" I need to be able to select via vars(). After converting all necessary bundles, I saved them in a new data.frame because I am unsure using tibbles. My Goal is to have a variable v.combined with the factor levels v1, v2 and v3.

This exact question has been posted 8 years ago in this thread. I tried the approaches they mentioned but they don't seem to work. They might be "outdated"? I end up with either more observations than before - which is interesting - or errors. In 8 years there must have happened something in developing R that might make the process easier.

Thank you everyone for your help!

Can you show your expected output as this is not clear from the code — akrun
– akrun, Commented Mar 23, 2022 at 15:22

Stefano Barbi · Accepted Answer · 2022-03-23 16:55:07Z

2

I am guessing that you want to revert a "one-hot encoding" of a variable. Here is a quick way to do it.

apply(df ,1,\(x) names(which(x == "yes"))) |>
  purrr::map_chr(~ifelse(length(.x) == 0, NA_character_, .x))

#+  [1] "v1" "v1" "v3" "v1" "v1" "v2" "v1" "v2" "v3" NA

A tidyverse approach would be:

df |>
  mutate(ID = row_number()) |>
  pivot_longer(cols = c(v1,v2,v3), names_to = "var") |>
  filter(value == "yes")

##>      ID var   value
##>   <int> <chr> <chr>
##> 1     1 v1    yes  
##> 2     2 v1    yes  
##> 3     3 v3    yes  
##> 4     4 v1    yes  
##> 5     5 v1    yes  
##> 6     6 v2    yes  
##> 7     7 v1    yes  
##> 8     8 v2    yes  
##> 9     9 v3    yes

edited Mar 23, 2022 at 16:55

answered Mar 23, 2022 at 16:41

Stefano Barbi

3,2041 gold badge14 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Olga Over a year ago

Thank you! Both ways worked absolutely perfectly - I am amazed.

Collectives™ on Stack Overflow

R: Converting multiple binary columns into one factor variable whose factors are binary columns

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related