Making a dataframe with binary data based on conditions in another

Question

I have a large data frame which I want to filter and make a binary data frame for based on several conditions.

This is the original data frame:

a1 <- data.frame(
  ID = c(rep("ID_1",3),rep("ID_2",3)),
  gene = c("A", "D", "X","D","D","A"),
  C = c("Q", "R", "S","S","R","Q"),
  D = c(8, 3, 3, 4, 5, 4),
  E = sample(c("silent","non-silent"),6,replace=T)
)

eg:

    ID  gene    C   D   E
1   ID_1    A   Q   8   non-silent
2   ID_1    D   R   3   silent
3   ID_1    X   S   3   silent
4   ID_2    D   S   4   non-silent
5   ID_2    D   R   5   silent
6   ID_2    A   Q   4   non-silent

I now have made an empty data frame with the IDs as columns and genes as rows as such:

dt=as.data.frame(matrix(NA, length(c(levels(a1$gene))), length(c(levels(a1$ID)))+1))
colnames(dt)[1] <- "gene"
dt[,"gene"]=c(levels(a1$gene))
colnames(dt)[-1]=levels(a1$ID)

    gene    ID_1    ID_2
1   A   NA  NA
2   D   NA  NA
3   X   NA  NA

Now I would want to put a 1 for genes that are present for each ID and 0 for those that are not present. I would later also want to include other conditions. For example only put a 1 for non-silent in the E column. Is there an R base way to do this or with a package such as data.table or ddply?

lukeA · Accepted Answer · 2014-06-25 11:28:12Z

3

You can use dcast from the reshape2 package:

library(reshape2)
dcast(a1, gene ~ ID)
#   gene ID_1 ID_2
# 1    A    1    1
# 2    D    1    2
# 3    X    1    0

or

dcast(a1, gene ~ ID, fun.aggregate = function(x) (length(x) > 0L) * 1L)
#   gene ID_1 ID_2
# 1    A    1    1
# 2    D    1    1
# 3    X    1    0

It's also available for data tables.

answered Jun 25, 2014 at 11:28

lukeA

54.4k5 gold badges102 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

paul_dg Over a year ago

This is a nice solution, it works perfectly for the example. But I get some strange results for the original data. An error message: Using freq as value column: use value.var to override. And not all IDs are present in the binarytable.

lukeA Over a year ago

I don't know about your real data, but to get rid of the warning (?) message just specify value.var = "ID" explicitly.

paul_dg Over a year ago

Solved it, thank you! But what if I want to include other columns, to for example only have non-silent in the E column?

lukeA Over a year ago

You could change the formula from gene ~ ID to gene + E ~ ID.

eddi Over a year ago

if OP's data is a data.table, use dcast.data.table instead of dcast to get the best of both worlds

konvas · Accepted Answer · 2014-06-25 11:12:39Z

1

To see if a gene is present for each ID:

dt$ID_1 <- dt$gene %in% a1[a1$ID == "ID_1", ]$gene
dt$ID_2 <- dt$gene %in% a1[a1$ID == "ID_2", ]$gene

so dt$ID_1 & dt$ID_2 will give you those that are present in both.

If you have many IDs and you want to iterate over them, you can use e.g. lapply and if you want to apply it to other columns you just need to replace this string by a variable (and turn it into a function).

answered Jun 25, 2014 at 11:12

konvas

14.4k2 gold badges43 silver badges46 bronze badges

Collectives™ on Stack Overflow

Making a dataframe with binary data based on conditions in another

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related