2

I'm reading in data from a csv file where each row contains some number of individual strings:

e.g.

data.csv ->
    x,f,t,h,b,g
    d,g,h
    g,h,a,s,d
    f
    q,w,e,r,t,y,u,i,o

data <- read.csv("data.csv", header = FALSE)

I want to transform this input into a data frame where the columns are the set of unique strings present in the input. In this case, the columns would be the set of strings {x,f,t,h,b,g,d,a,s,q,w,e,r,y,u,i,o}. Additionally, the new data frame should contain a row for each row in the input data frame such that a column will have the value 1 if the column's name was present in that row in the input data frame, or 0 if the column's name was not present in that input row.

In this example, the desired output would be the following:

    x f t h b g d a s q w e r y u i o
   ----------------------------------
1 | 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
2 | 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 | 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
4 | 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 | 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1

The code below is what I currently have. However, the output df ends up being a data frame with what appear to be the correct columns, but 0 rows.

I'm very inexperienced at R, and this was my attempt at putting together something that works. It seems to work as expected up until the call to apply(), which unexpectedly doesn't add anything to df.

data <- read.csv("data.csv", header = FALSE)

columnNames = c()
for (row in data) {
  for (eventName in row) {
    if (!(eventName %in% columnNames)) {
      columnNames = c(columnNames, eventName)
    }
  }
}
columnNames = t(columnNames)

df = data.frame(columnNames)
colnames(df) = columnNames
df = df[-1,]

apply(data, 1, function(row, df) {
  dat = data.frame(columnNames)
  colnames(dat) = columnNames
  dat = dat[-1,]

  for (eventName in row) {
    if (eventName != "") {
      dat[1,eventName] = 1
    }
  }

  df = rbind(df, dat)
}, df)

After the script finishes it tells me there were many warnings of the following two forms:

9: In `[<-.factor`(`*tmp*`, iseq, value = 1) : invalid factor level, NA generated
10: In `[<-.factor`(`*tmp*`, iseq, value = 1) :
  invalid factor level, NA generated
3
  • Please use dput to show the dataset. It is confusing from your input. Commented Jul 30, 2016 at 3:43
  • I'm just showing that the input is a standard, comma-separated csv file which gets converted by read.csv Commented Jul 30, 2016 at 3:47
  • Okay, anyway I updated the post if all of these are single character strings. Commented Jul 30, 2016 at 3:49

1 Answer 1

3

We can use mtabulate after splitting the column by ,

library(qdapTools)
mtabulate(strsplit(as.character(df1[,1]), ","))

Or with base R methods by splitting the column by ,, set the names of the list output as the sequence of rows, convert the list to data.frame (stack), change the 'values' column to factor with levels specified and then get the frequency with table.

table(transform(stack(setNames(strsplit(as.character(df1[,1]), ","), 1:nrow(df1)))[2:1], 
           values = factor(values, levels = unique(values))))
# 
#    x f t h b g d a s q w e r y u i o
#  1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
#  2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
#  3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
#  4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1

Update

If this is not a single column,

mtabulate(apply(df2, 1, FUN = function(x) x[x!=""]))

Or

as.data.frame.matrix(table(transform(stack(setNames(apply(df2, 1, 
     FUN = function(x) x[x!=""]), 
           1:nrow(df2)))[2:1],  values = factor(values, levels = unique(values)))))
#    
#    x f t h b g d a s q w e r y u i o
#  1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
#  2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
#  3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
#  4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1

data

df1 <- structure(list(V1 = c("x,f,t,h,b,g", "d,g,h", "g,h,a,s,d", "f", 
"q,w,e,r,t,y,u,i,o")), .Names = "V1", class = "data.frame", 
row.names = c(NA, -5L))

df2 <- structure(list(v1 = c("x", "d", "g", "f", "q"), v2 = c("f", "g", 
"h", "", "w"), v3 = c("t", "h", "a", "", "e"), v4 = c("h", "", 
"s", "", "r"), v5 = c("b", "", "d", "", "t"), v6 = c("g", "", 
"", "", "y"), v7 = c("", "", "", "", "u"), v8 = c("", "", "", 
"", "i"), v9 = c("", "", "", "", "o")), .Names = c("v1", "v2", 
"v3", "v4", "v5", "v6", "v7", "v8", "v9"), row.names = c(NA, 
-5L), class = "data.frame")
Sign up to request clarification or add additional context in comments.

7 Comments

@fvgs Here I am assuming that you have a single column that have these strings.
Where are you assuming a single column? The data being read in is from a csv of the form described above. Neither the input nor the output are single columns. I'm trying out your example, but the output (using View) is very different from the desired data frame
@fvgs Okay, if it is not a single column, you must be having blanks in other columns, right as the number of elements in each row is different and I haven't seen you using fill=TRUE in read.csv`
I tried to address the blank issue by checking if eventName != "" in the last for loop. But using fill=TRUE seems like a much cleaner solution
Yes indeed, thanks for all your help! Now I just need to understand exactly what each transformation is doing...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.