Converting Categorical Columns into Multiple Binary Columns in R

Question

I am trying to convert a column that has categorical data ('A', 'B', or 'C') to 3 columns where 1,0,0 would be 'A'; 0,1,0 would represent 'B', etc.

I found this code online:

flags = data.frame(Reduce(cbind, 
     lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
names(flags) = levels(d$purpose)
d = cbind(d, flags)

# Include the new columns as input variables
levelnames = paste(names(flags), collapse = " + ")
neuralnet(paste("output ~ ", levelnames), d)

Converting categorical variables in R for ANN (neuralnet)

But I'm very new to R. Can anyone break down what this complicated looking code is doing?

edit:

Implementing @nongkrong's recommendations I'm running into a problem:

CSV:

X1,X2,X3
A,D,Q
B,E,R
C,F,S
B,G,T
C,H,U
A,D,Q

R:

newData <- read.csv("new.csv")
newerData <- model.matrix(~ X1 + X2 + X3 -1, data=newData)
newerData

R Output:

  X1A X1B X1C X2E X2F X2G X2H X3R X3S X3T X3U
1   1   0   0   0   0   0   0   0   0   0   0
2   0   1   0   1   0   0   0   1   0   0   0
3   0   0   1   0   1   0   0   0   1   0   0
4   0   1   0   0   0   1   0   0   0   1   0
5   0   0   1   0   0   0   1   0   0   0   1
6   1   0   0   0   0   0   0   0   0   0   0

It works great with 1 column, but is missing X2D and X3Q. Any ideas why?

i don't think this code is necessary, you can use simply model.matrix(~ purpose -1, data=d), but all it is doing is expanding the factor variable into a bunch of dummy columns. Each dummy column corresponds to a level of the original factor, and is 1 where that factor was present in the original — Rorschach
– Rorschach, Commented Jul 30, 2015 at 21:37
Awesome, thanks! I got this to work great with 1 column but am getting odd results with multiple columns (see my edit of op) — Adam12344
– Adam12344, Commented Jul 30, 2015 at 22:08
I guess it's because of -1, try removing that and see what you get (though I would have expected X1A to be dropped as well...) — MichaelChirico
– MichaelChirico, Commented Jul 30, 2015 at 22:35
It removed an Intercept column. I got around it for now by doing 1 column at a time and using cbind to combine them — Adam12344
– Adam12344, Commented Jul 30, 2015 at 22:57
the output is like that because these dummy columns are contrasting different combinations of your factors against a base case, the intercept (which was removed from the model by using -1). I'm not sure how to include those columns as dummies as well, sadly — Rorschach
– Rorschach, Commented Jul 30, 2015 at 22:57

MichaelChirico · Accepted Answer · 2018-04-10 06:23:36Z

2

@nongkrong is right--read ?formulas and you'll see that most functions that accept formulas as input (e.g. lm, glm, etc.) will automatically convert categorical variables (stored as factors or characters) to dummies; you can force this on non-factor numeric variables by specifying as.factor(var) in your formula.

That said, I've encountered situations where it's convenient to have created these indicators by hand anyway--e.g., a data set with an ethnicity variable where <1% of the data fit in one or several of the ethnicity codes. There are other ways to deal with this (simply delete the minority-minority observations, e.g.), but I find that varies by situation.

So, I've annotated the code for you:

flags = data.frame(Reduce(cbind, 
     lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))

Lots going on in this first line, so let's go bit-by-bit:

d$purpose==x checks each entry of d$purpose for equality to x; the result will be TRUE or FALSE (or NA if there are missing values). Multiplying by 1 (*1) forces the output to be an integer (so TRUE becomes 1 and FALSE becomes 0).

lapply applies the function in its second argument to each element of its first argument--so for each element of levels(d$purpose) (i.e., each level of d$purpose), we output a vector of 0s and 1s, where the 1s correspond to the elements of d$purpose matching the given level. The output of lapply is a list (hence l in front of apply), with one list element corresponding to each of the levels of d$purpose.

We want to get this into our data.frame, so a list isn't very useful; Reduce is what we use to back out the information from the list to a data.frame form. Reduce(cbind,LIST) is the same as cbind(LIST[[1]],LIST[[2]],LIST[[3]],...)--convenient shorthand, especially when we don't know the length of LIST.

Wrapping this in data.frame casts this into the mode data.frame.

#This line simply puts column names on each of the indicator variables
#  Note that you can replace the RHS of this line with whatever 
#  naming convention you want for the levels--a common approach might
#  be to specify paste0(levels(d$purpose),"_flag"), e.g.
names(flags) = levels(d$purpose)
#this line adds all the indicator variables to the original 
#  data.frame
d = cbind(d, flags)
#this creates a string of the form "level1 + level2 + ... + leveln"
levelnames = paste(names(flags), collapse = " + ")
#finally we create a formula of the form y~x+d1+d2+d3
#  where each of the d* is a dummy for a level of the categorical variable
neuralnet(paste("output ~ ", levelnames), d)

Also note that something like this could have been done much simpler in the data.table package:

library(data.table)
setDT(d)
l = levels(purpose)
d[ , (l) := lapply(l, function(x) as.integer(purpose == x))]
d[ , neuralnet(paste0("output~", paste0(l, collapse = "+"))]

edited Apr 10, 2018 at 6:23

answered Jul 30, 2015 at 21:56

MichaelChirico

34.9k17 gold badges122 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

dondapati Over a year ago

@Michael.How to write if suppose i have multiple columns with multiple factors in each column.Can i repeat the Code for each column ?

dondapati Over a year ago

I write the following Code but it's not working,dummy data .

df <- structure(list(x1 = structure(1:3, .Label = c("a", "b", "c"), class = "factor"),      x2 = structure(1:3, .Label = c("x", "y", "z"), class = "factor")), .Names = c("x1",  "x2"), row.names = c(NA, -3L), class = "data.frame")

dondapati Over a year ago

fact1 <- c('x1','x2'); for(i in seq_along(1:2)){    print( lapply(lapply(df[fact1],                               function(x) levels(x))[[i]],                        function(x){df[fact1][[i]]== x}*1))    }

MichaelChirico Over a year ago

@user7462639 i recommend asking a new question, making sure to cite this question and include your attempts thus far. as i recommended in my answer, you probably don't want to do this yourself; see also the model.matrix function

Cornflake · Accepted Answer · 2025-03-13 18:51:44Z

It is a less sophisticated solution, but still a solution. Just use this function where the base argument is your data.frame object containing the categorical columns. The only thing is that if you have a database with categorical and numerical columns, you will need to create a new sub-database and apply the function to such a new sub-database. Otherwise, the function will "binarize" every column, even the numerical ones.

CategoriesIntoBinaries<-function(base)
{
    NewColumns<-cbind()
    index_names<-1
    for( j in 1:ncol(base) )
    {
        Column_j<-factor(base[,j])
        UniqueCol_j<-unique(Column_j)
        size_UniqueCol_j<-length(UniqueCol_j)
        ColumnName_j<-paste0( colnames(base)[j] )
        SubColumunsName_j<-vector(length=size_UniqueCol_j)
        for(i in 1:size_UniqueCol_j)
        {
            aux_j<-as.numeric(Column_j == UniqueCol_j[i])
            NewColumns<-cbind(NewColumns, aux_j)
            SubColumunsName_j[i]<-paste0(ColumnName_j, "_", UniqueCol_j[i] ) 
        }
        colnames(NewColumns)[ index_names:(size_UniqueCol_j+index_names -1) ]<-SubColumunsName_j
        index_names<- index_names +length(UniqueCol_j)
    }
    return(NewColumns)

}

## Example
NewBase<-CategoriesIntoBinaries(base)
head(NewBase)

Collectives™ on Stack Overflow

Converting Categorical Columns into Multiple Binary Columns in R

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related