1

I am trying to convert a column that has categorical data ('A', 'B', or 'C') to 3 columns where 1,0,0 would be 'A'; 0,1,0 would represent 'B', etc.

I found this code online:

flags = data.frame(Reduce(cbind, 
     lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
names(flags) = levels(d$purpose)
d = cbind(d, flags)

# Include the new columns as input variables
levelnames = paste(names(flags), collapse = " + ")
neuralnet(paste("output ~ ", levelnames), d)

Converting categorical variables in R for ANN (neuralnet)

But I'm very new to R. Can anyone break down what this complicated looking code is doing?

edit:

Implementing @nongkrong's recommendations I'm running into a problem:

CSV:

X1,X2,X3
A,D,Q
B,E,R
C,F,S
B,G,T
C,H,U
A,D,Q

R:

newData <- read.csv("new.csv")
newerData <- model.matrix(~ X1 + X2 + X3 -1, data=newData)
newerData

R Output:

  X1A X1B X1C X2E X2F X2G X2H X3R X3S X3T X3U
1   1   0   0   0   0   0   0   0   0   0   0
2   0   1   0   1   0   0   0   1   0   0   0
3   0   0   1   0   1   0   0   0   1   0   0
4   0   1   0   0   0   1   0   0   0   1   0
5   0   0   1   0   0   0   1   0   0   0   1
6   1   0   0   0   0   0   0   0   0   0   0

It works great with 1 column, but is missing X2D and X3Q. Any ideas why?

5
  • 5
    i don't think this code is necessary, you can use simply model.matrix(~ purpose -1, data=d), but all it is doing is expanding the factor variable into a bunch of dummy columns. Each dummy column corresponds to a level of the original factor, and is 1 where that factor was present in the original Commented Jul 30, 2015 at 21:37
  • Awesome, thanks! I got this to work great with 1 column but am getting odd results with multiple columns (see my edit of op) Commented Jul 30, 2015 at 22:08
  • I guess it's because of -1, try removing that and see what you get (though I would have expected X1A to be dropped as well...) Commented Jul 30, 2015 at 22:35
  • 1
    It removed an Intercept column. I got around it for now by doing 1 column at a time and using cbind to combine them Commented Jul 30, 2015 at 22:57
  • 1
    the output is like that because these dummy columns are contrasting different combinations of your factors against a base case, the intercept (which was removed from the model by using -1). I'm not sure how to include those columns as dummies as well, sadly Commented Jul 30, 2015 at 22:57

2 Answers 2

2

@nongkrong is right--read ?formulas and you'll see that most functions that accept formulas as input (e.g. lm, glm, etc.) will automatically convert categorical variables (stored as factors or characters) to dummies; you can force this on non-factor numeric variables by specifying as.factor(var) in your formula.

That said, I've encountered situations where it's convenient to have created these indicators by hand anyway--e.g., a data set with an ethnicity variable where <1% of the data fit in one or several of the ethnicity codes. There are other ways to deal with this (simply delete the minority-minority observations, e.g.), but I find that varies by situation.

So, I've annotated the code for you:

flags = data.frame(Reduce(cbind, 
     lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))

Lots going on in this first line, so let's go bit-by-bit:

d$purpose==x checks each entry of d$purpose for equality to x; the result will be TRUE or FALSE (or NA if there are missing values). Multiplying by 1 (*1) forces the output to be an integer (so TRUE becomes 1 and FALSE becomes 0).

lapply applies the function in its second argument to each element of its first argument--so for each element of levels(d$purpose) (i.e., each level of d$purpose), we output a vector of 0s and 1s, where the 1s correspond to the elements of d$purpose matching the given level. The output of lapply is a list (hence l in front of apply), with one list element corresponding to each of the levels of d$purpose.

We want to get this into our data.frame, so a list isn't very useful; Reduce is what we use to back out the information from the list to a data.frame form. Reduce(cbind,LIST) is the same as cbind(LIST[[1]],LIST[[2]],LIST[[3]],...)--convenient shorthand, especially when we don't know the length of LIST.

Wrapping this in data.frame casts this into the mode data.frame.

#This line simply puts column names on each of the indicator variables
#  Note that you can replace the RHS of this line with whatever 
#  naming convention you want for the levels--a common approach might
#  be to specify paste0(levels(d$purpose),"_flag"), e.g.
names(flags) = levels(d$purpose)
#this line adds all the indicator variables to the original 
#  data.frame
d = cbind(d, flags)
#this creates a string of the form "level1 + level2 + ... + leveln"
levelnames = paste(names(flags), collapse = " + ")
#finally we create a formula of the form y~x+d1+d2+d3
#  where each of the d* is a dummy for a level of the categorical variable
neuralnet(paste("output ~ ", levelnames), d)

Also note that something like this could have been done much simpler in the data.table package:

library(data.table)
setDT(d)
l = levels(purpose)
d[ , (l) := lapply(l, function(x) as.integer(purpose == x))]
d[ , neuralnet(paste0("output~", paste0(l, collapse = "+"))]
Sign up to request clarification or add additional context in comments.

4 Comments

@Michael.How to write if suppose i have multiple columns with multiple factors in each column.Can i repeat the Code for each column ?
I write the following Code but it's not working,dummy data . df <- structure(list(x1 = structure(1:3, .Label = c("a", "b", "c"), class = "factor"), x2 = structure(1:3, .Label = c("x", "y", "z"), class = "factor")), .Names = c("x1", "x2"), row.names = c(NA, -3L), class = "data.frame")
fact1 <- c('x1','x2'); for(i in seq_along(1:2)){ print( lapply(lapply(df[fact1], function(x) levels(x))[[i]], function(x){df[fact1][[i]]== x}*1)) }
@user7462639 i recommend asking a new question, making sure to cite this question and include your attempts thus far. as i recommended in my answer, you probably don't want to do this yourself; see also the model.matrix function
0

It is a less sophisticated solution, but still a solution. Just use this function where the base argument is your data.frame object containing the categorical columns. The only thing is that if you have a database with categorical and numerical columns, you will need to create a new sub-database and apply the function to such a new sub-database. Otherwise, the function will "binarize" every column, even the numerical ones.

CategoriesIntoBinaries<-function(base)
{
    NewColumns<-cbind()
    index_names<-1
    for( j in 1:ncol(base) )
    {
        Column_j<-factor(base[,j])
        UniqueCol_j<-unique(Column_j)
        size_UniqueCol_j<-length(UniqueCol_j)
        ColumnName_j<-paste0( colnames(base)[j] )
        SubColumunsName_j<-vector(length=size_UniqueCol_j)
        for(i in 1:size_UniqueCol_j)
        {
            aux_j<-as.numeric(Column_j == UniqueCol_j[i])
            NewColumns<-cbind(NewColumns, aux_j)
            SubColumunsName_j[i]<-paste0(ColumnName_j, "_", UniqueCol_j[i] ) 
        }
        colnames(NewColumns)[ index_names:(size_UniqueCol_j+index_names -1) ]<-SubColumunsName_j
        index_names<- index_names +length(UniqueCol_j)
    }
    return(NewColumns)

}

## Example
NewBase<-CategoriesIntoBinaries(base)
head(NewBase)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.