0

I have a large data frame which I want to filter and make a binary data frame for based on several conditions.

This is the original data frame:

a1 <- data.frame(
  ID = c(rep("ID_1",3),rep("ID_2",3)),
  gene = c("A", "D", "X","D","D","A"),
  C = c("Q", "R", "S","S","R","Q"),
  D = c(8, 3, 3, 4, 5, 4),
  E = sample(c("silent","non-silent"),6,replace=T)
)

eg:

    ID  gene    C   D   E
1   ID_1    A   Q   8   non-silent
2   ID_1    D   R   3   silent
3   ID_1    X   S   3   silent
4   ID_2    D   S   4   non-silent
5   ID_2    D   R   5   silent
6   ID_2    A   Q   4   non-silent

I now have made an empty data frame with the IDs as columns and genes as rows as such:

dt=as.data.frame(matrix(NA, length(c(levels(a1$gene))), length(c(levels(a1$ID)))+1))
colnames(dt)[1] <- "gene"
dt[,"gene"]=c(levels(a1$gene))
colnames(dt)[-1]=levels(a1$ID)

    gene    ID_1    ID_2
1   A   NA  NA
2   D   NA  NA
3   X   NA  NA

Now I would want to put a 1 for genes that are present for each ID and 0 for those that are not present. I would later also want to include other conditions. For example only put a 1 for non-silent in the E column. Is there an R base way to do this or with a package such as data.table or ddply?

2 Answers 2

3

You can use dcast from the reshape2 package:

library(reshape2)
dcast(a1, gene ~ ID)
#   gene ID_1 ID_2
# 1    A    1    1
# 2    D    1    2
# 3    X    1    0

or

dcast(a1, gene ~ ID, fun.aggregate = function(x) (length(x) > 0L) * 1L)
#   gene ID_1 ID_2
# 1    A    1    1
# 2    D    1    1
# 3    X    1    0

It's also available for data tables.

Sign up to request clarification or add additional context in comments.

5 Comments

This is a nice solution, it works perfectly for the example. But I get some strange results for the original data. An error message: Using freq as value column: use value.var to override. And not all IDs are present in the binarytable.
I don't know about your real data, but to get rid of the warning (?) message just specify value.var = "ID" explicitly.
Solved it, thank you! But what if I want to include other columns, to for example only have non-silent in the E column?
You could change the formula from gene ~ ID to gene + E ~ ID.
if OP's data is a data.table, use dcast.data.table instead of dcast to get the best of both worlds
1

To see if a gene is present for each ID:

dt$ID_1 <- dt$gene %in% a1[a1$ID == "ID_1", ]$gene
dt$ID_2 <- dt$gene %in% a1[a1$ID == "ID_2", ]$gene

so dt$ID_1 & dt$ID_2 will give you those that are present in both.

If you have many IDs and you want to iterate over them, you can use e.g. lapply and if you want to apply it to other columns you just need to replace this string by a variable (and turn it into a function).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.