4

I was given a horrendous dataset which I am struggling to clean up: 272 (character) variables and 343 observations. It consists of a lot of binary variables that could have been summarized into one variable with multiple factors. So instead of asking "are you self employed or employed?" and given the options 1 "self employed", 2 "employed" and maybe a 3 "none/other", the set has two variables: v1.selfemployed and v2.employed with options 1 "yes" and 2 "no".

I now need to combine several binary variables into one. Since they are characters, I need to convert them into factors, which I did (see example).

### datasetdataset
v1 <- as.character(c("yes", "yes", "no", "yes", "yes", "no", "yes","no", "no", NA ))
v2 <- as.character(c("no","no","no","no","no","yes","no","yes", "no", NA))
v3 <- as.character(c("no","no", "yes", "no","no","no","no","no", "yes", NA))

df <- data.frame(v1,v2,v3)
library(tidyverse)

## dataframe -> tibble
df.t <- as_tibble(df)

## convert into 1/0 factor
df.t %>%
  mutate_if(is.character, as.factor) %>% 
  mutate_at(vars(1:3), ~fct_recode(., "1" = "yes", 
                                          "0" = "no"))

I took this route because I have many binary "bundles" I need to be able to select via vars(). After converting all necessary bundles, I saved them in a new data.frame because I am unsure using tibbles. My Goal is to have a variable v.combined with the factor levels v1, v2 and v3.

This exact question has been posted 8 years ago in this thread. I tried the approaches they mentioned but they don't seem to work. They might be "outdated"? I end up with either more observations than before - which is interesting - or errors. In 8 years there must have happened something in developing R that might make the process easier.

Thank you everyone for your help!

1
  • Can you show your expected output as this is not clear from the code Commented Mar 23, 2022 at 15:22

1 Answer 1

2

I am guessing that you want to revert a "one-hot encoding" of a variable. Here is a quick way to do it.

apply(df ,1,\(x) names(which(x == "yes"))) |>
  purrr::map_chr(~ifelse(length(.x) == 0, NA_character_, .x))

#+  [1] "v1" "v1" "v3" "v1" "v1" "v2" "v1" "v2" "v3" NA  

A tidyverse approach would be:

df |>
  mutate(ID = row_number()) |>
  pivot_longer(cols = c(v1,v2,v3), names_to = "var") |>
  filter(value == "yes")

##>      ID var   value
##>   <int> <chr> <chr>
##> 1     1 v1    yes  
##> 2     2 v1    yes  
##> 3     3 v3    yes  
##> 4     4 v1    yes  
##> 5     5 v1    yes  
##> 6     6 v2    yes  
##> 7     7 v1    yes  
##> 8     8 v2    yes  
##> 9     9 v3    yes  

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! Both ways worked absolutely perfectly - I am amazed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.