Transform dataframe of characters into a "more clear" dataframe with binary variables in R

Question

Starting from a dataframe in R like the following (df):

year_1 <- c('James','Mike','Jane', NA)
year_2 <- c('Evelyn', 'Jackson', 'James', 'Avery')
year_3 <- c('Harper', 'Avery', NA, NA)
df <- data.frame(year_1, year_2, year_3)

...I would like convert it into something like df1 (of course I have hundreds of elements in my original dataframe, so I can't go manually)

names <- c('James','Mike','Jane','Evelyn', 'Jackson', 'Avery', 'Harper')
year_1 <- c('YES','YES','YES', 'NO', 'NO', 'NO', 'NO')
year_2 <- c('YES','NO','NO', 'YES', 'YES', 'YES', 'NO')
year_3 <- c('NO','NO','NO', 'NO', 'NO', 'YES', 'YES')
df_1 <- data.frame(year_1, year_2, year_3)
rownames(df_1) <- names

I have tried to:

convert all elements of df into a string vector with unique elements
construct the structure of df1 taking the names of step 1)
try to fill df1 with a loop (here is where I am not able to build a proper loop that makes the trick)

Any idea?

Thanks!!

You can do something like as.data.frame.matrix(table(stack(df))). — iroha
– iroha, Commented Dec 17, 2020 at 20:24
"Error in stack.data.frame(df) : no vector columns were selected" — vog
– vog, Commented Dec 17, 2020 at 20:26

ThomasIsCoding · Accepted Answer · 2020-12-17 20:36:16Z

3

A base R option using stack + table

> as.data.frame(ifelse(table(stack(df)) == 1, "YES", "NO"))
        year_1 year_2 year_3
Avery       NO    YES    YES
Evelyn      NO    YES     NO
Harper      NO     NO    YES
Jackson     NO    YES     NO
James      YES    YES     NO
Jane       YES     NO     NO
Mike       YES     NO     NO

answered Dec 17, 2020 at 20:36

ThomasIsCoding

106k9 gold badges38 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

A5C1D2H2I1M1N2O1R2T1 Over a year ago

+1 but I'd probably use lapply(df, as.character) given their comment under their question, and [] replacement instead of ifelse. Something like: x <- table(stack(lapply(df, as.character))) + 1; x[] <- c("NO", "YES")[x]; x.

ThomasIsCoding Over a year ago

@A5C1D2H2I1M1N2O1R2T1 Yes, it makes sense, and we can use type.convert(df,as.is = TRUE) if possible

jay.sf · Accepted Answer · 2020-12-17 20:28:06Z

2

What about this?

sapply(df, function(x) sapply(na.omit(unique(unlist(df))), `%in%`, x))
#         year_1 year_2 year_3
# James     TRUE   TRUE  FALSE
# Mike      TRUE  FALSE  FALSE
# Jane      TRUE  FALSE  FALSE
# Evelyn   FALSE   TRUE  FALSE
# Jackson  FALSE   TRUE  FALSE
# Avery    FALSE   TRUE   TRUE
# Harper   FALSE  FALSE   TRUE

answered Dec 17, 2020 at 20:28

jay.sf

76.3k8 gold badges66 silver badges132 bronze badges

Comments

akrun · Accepted Answer · 2020-12-17 20:24:39Z

1

here is an option with tidyverse where we reshape the data into 'long' format pivot_longer, get the distinct rows, create a column of 'YES' and reshape back to 'wide' with pivot_wider

library(dplyr)
library(tidyr)
library(tibble)
df %>%
  pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
  distinct %>%
  mutate(new = 'YES') %>% 
  pivot_wider(names_from = name, values_from = new, values_fill = 'NO') %>%
  column_to_rownames("value")

-output

#          year_1 year_2 year_3
#James      YES    YES     NO
#Evelyn      NO    YES     NO
#Harper      NO     NO    YES
#Mike       YES     NO     NO
#Jackson     NO    YES     NO
#Avery       NO    YES    YES
#Jane       YES     NO     NO

answered Dec 17, 2020 at 20:24

akrun

891k38 gold badges590 silver badges700 bronze badges

1 Comment

vog Over a year ago

Super answer. Very fast and clean!! Thank you so much

McKay Joseph Hall · Accepted Answer · 2020-12-17 22:39:44Z

To offer another option, first we can extract the unique names from df using a nested for loop. We test if the name is already in our list, and further test if we're looking at an NA.

people<-c()
for (i in 1:length(colnames(df))){
  for (j in 1:length(df[,1])){
    pers<-df[j,i]
    if (!(pers %in% people)){
      if (!is.na(pers)){
        people<-c(people,toString(pers))
      }
    }
  }
}

From here, we can iterate a simple %in% check over each year and combine into a full dataframe. The above answers are probably more straightforward, but I've found code like this is useful if you need to make other small changes to the data as it passes through the script.

for (i in 1:length(colnames(df))){
  colname<-colnames(df)[i]
  peoplein<-people %in% df[,i]
  if (i == 1){
    df1<-cbind(people,peoplein)
    colnames(df1)[i+1]<-colname
  } else {
    df1<-cbind(df1,peoplein)
    colnames(df1)[i+1]<-colname
  }
}

The resulting df1 is shown below.

     people    year_1  year_2  year_3 
[1,] "James"   "TRUE"  "TRUE"  "FALSE"
[2,] "Mike"    "TRUE"  "FALSE" "FALSE"
[3,] "Jane"    "TRUE"  "FALSE" "FALSE"
[4,] "Evelyn"  "FALSE" "TRUE"  "FALSE"
[5,] "Jackson" "FALSE" "TRUE"  "FALSE"
[6,] "Avery"   "FALSE" "TRUE"  "TRUE" 
[7,] "Harper"  "FALSE" "FALSE" "TRUE"

Collectives™ on Stack Overflow

Transform dataframe of characters into a "more clear" dataframe with binary variables in R

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related