Convert data frame of values into binary data frame where each unique value is a column

Question

I'm reading in data from a csv file where each row contains some number of individual strings:

e.g.

data.csv ->
    x,f,t,h,b,g
    d,g,h
    g,h,a,s,d
    f
    q,w,e,r,t,y,u,i,o

data <- read.csv("data.csv", header = FALSE)

I want to transform this input into a data frame where the columns are the set of unique strings present in the input. In this case, the columns would be the set of strings {x,f,t,h,b,g,d,a,s,q,w,e,r,y,u,i,o}. Additionally, the new data frame should contain a row for each row in the input data frame such that a column will have the value 1 if the column's name was present in that row in the input data frame, or 0 if the column's name was not present in that input row.

In this example, the desired output would be the following:

    x f t h b g d a s q w e r y u i o
   ----------------------------------
1 | 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
2 | 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 | 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
4 | 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 | 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1

The code below is what I currently have. However, the output df ends up being a data frame with what appear to be the correct columns, but 0 rows.

I'm very inexperienced at R, and this was my attempt at putting together something that works. It seems to work as expected up until the call to apply(), which unexpectedly doesn't add anything to df.

data <- read.csv("data.csv", header = FALSE)

columnNames = c()
for (row in data) {
  for (eventName in row) {
    if (!(eventName %in% columnNames)) {
      columnNames = c(columnNames, eventName)
    }
  }
}
columnNames = t(columnNames)

df = data.frame(columnNames)
colnames(df) = columnNames
df = df[-1,]

apply(data, 1, function(row, df) {
  dat = data.frame(columnNames)
  colnames(dat) = columnNames
  dat = dat[-1,]

  for (eventName in row) {
    if (eventName != "") {
      dat[1,eventName] = 1
    }
  }

  df = rbind(df, dat)
}, df)

After the script finishes it tells me there were many warnings of the following two forms:

9: In `[<-.factor`(`*tmp*`, iseq, value = 1) : invalid factor level, NA generated
10: In `[<-.factor`(`*tmp*`, iseq, value = 1) :
  invalid factor level, NA generated

Please use dput to show the dataset. It is confusing from your input. — akrun
– akrun, Commented Jul 30, 2016 at 3:43
I'm just showing that the input is a standard, comma-separated csv file which gets converted by read.csv — fvgs
– fvgs, Commented Jul 30, 2016 at 3:47
Okay, anyway I updated the post if all of these are single character strings. — akrun
– akrun, Commented Jul 30, 2016 at 3:49

akrun · Accepted Answer · 2016-07-30 04:04:57Z

3

We can use mtabulate after splitting the column by ,

library(qdapTools)
mtabulate(strsplit(as.character(df1[,1]), ","))

Or with base R methods by splitting the column by ,, set the names of the list output as the sequence of rows, convert the list to data.frame (stack), change the 'values' column to factor with levels specified and then get the frequency with table.

table(transform(stack(setNames(strsplit(as.character(df1[,1]), ","), 1:nrow(df1)))[2:1], 
           values = factor(values, levels = unique(values))))
# 
#    x f t h b g d a s q w e r y u i o
#  1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
#  2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
#  3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
#  4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1

Update

If this is not a single column,

mtabulate(apply(df2, 1, FUN = function(x) x[x!=""]))

Or

as.data.frame.matrix(table(transform(stack(setNames(apply(df2, 1, 
     FUN = function(x) x[x!=""]), 
           1:nrow(df2)))[2:1],  values = factor(values, levels = unique(values)))))
#    
#    x f t h b g d a s q w e r y u i o
#  1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
#  2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
#  3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
#  4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1

data

df1 <- structure(list(V1 = c("x,f,t,h,b,g", "d,g,h", "g,h,a,s,d", "f", 
"q,w,e,r,t,y,u,i,o")), .Names = "V1", class = "data.frame", 
row.names = c(NA, -5L))

df2 <- structure(list(v1 = c("x", "d", "g", "f", "q"), v2 = c("f", "g", 
"h", "", "w"), v3 = c("t", "h", "a", "", "e"), v4 = c("h", "", 
"s", "", "r"), v5 = c("b", "", "d", "", "t"), v6 = c("g", "", 
"", "", "y"), v7 = c("", "", "", "", "u"), v8 = c("", "", "", 
"", "i"), v9 = c("", "", "", "", "o")), .Names = c("v1", "v2", 
"v3", "v4", "v5", "v6", "v7", "v8", "v9"), row.names = c(NA, 
-5L), class = "data.frame")

edited Jul 30, 2016 at 4:04

answered Jul 30, 2016 at 3:26

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

akrun Over a year ago

@fvgs Here I am assuming that you have a single column that have these strings.

fvgs Over a year ago

Where are you assuming a single column? The data being read in is from a csv of the form described above. Neither the input nor the output are single columns. I'm trying out your example, but the output (using View) is very different from the desired data frame

akrun Over a year ago

@fvgs Okay, if it is not a single column, you must be having blanks in other columns, right as the number of elements in each row is different and I haven't seen you using fill=TRUE in read.csv`

fvgs Over a year ago

I tried to address the blank issue by checking if eventName != "" in the last for loop. But using fill=TRUE seems like a much cleaner solution

fvgs Over a year ago

Yes indeed, thanks for all your help! Now I just need to understand exactly what each transformation is doing...

|

Collectives™ on Stack Overflow

Convert data frame of values into binary data frame where each unique value is a column

1 Answer 1

Update

data

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Update

data

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related