I'm reading in data from a csv file where each row contains some number of individual strings:
e.g.
data.csv ->
x,f,t,h,b,g
d,g,h
g,h,a,s,d
f
q,w,e,r,t,y,u,i,o
data <- read.csv("data.csv", header = FALSE)
I want to transform this input into a data frame where the columns are the set of unique strings present in the input. In this case, the columns would be the set of strings {x,f,t,h,b,g,d,a,s,q,w,e,r,y,u,i,o}. Additionally, the new data frame should contain a row for each row in the input data frame such that a column will have the value 1 if the column's name was present in that row in the input data frame, or 0 if the column's name was not present in that input row.
In this example, the desired output would be the following:
x f t h b g d a s q w e r y u i o
----------------------------------
1 | 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
2 | 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 | 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
4 | 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 | 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
The code below is what I currently have. However, the output df ends up being a data frame with what appear to be the correct columns, but 0 rows.
I'm very inexperienced at R, and this was my attempt at putting together something that works. It seems to work as expected up until the call to apply(), which unexpectedly doesn't add anything to df.
data <- read.csv("data.csv", header = FALSE)
columnNames = c()
for (row in data) {
for (eventName in row) {
if (!(eventName %in% columnNames)) {
columnNames = c(columnNames, eventName)
}
}
}
columnNames = t(columnNames)
df = data.frame(columnNames)
colnames(df) = columnNames
df = df[-1,]
apply(data, 1, function(row, df) {
dat = data.frame(columnNames)
colnames(dat) = columnNames
dat = dat[-1,]
for (eventName in row) {
if (eventName != "") {
dat[1,eventName] = 1
}
}
df = rbind(df, dat)
}, df)
After the script finishes it tells me there were many warnings of the following two forms:
9: In `[<-.factor`(`*tmp*`, iseq, value = 1) : invalid factor level, NA generated
10: In `[<-.factor`(`*tmp*`, iseq, value = 1) :
invalid factor level, NA generated
dputto show the dataset. It is confusing from your input.read.csv