0

I'm working on populating a binary matrix based on values from a different table. I can create the matrix but am struggling with the looping needed to populate it. I think this is a pretty simple issue so I hope I can get some easy help.

Here's an example of my data:

start <- c(291, 291, 291, 702, 630, 768)
sequence <- c("chr9:103869456:103870456", "chr5:30823103:30824103", "chr11:49801703:49802703", "chr4:133865601:133866601", "chr12:55738034:55739034", "chr8:96569493:96570493")
motif <- c("ARI5B", "ARI5B", "ARI5B", "ATOH1", "EGR1", "EGR1")

df <- data.frame(start, sequence, motif)

I have created a character vector for each unique motif+start values like so:

x <- sprintf("%s_%d", df$motif, df$start)
x <- unique(x)

Next I create a binary matrix with the sequences as rows and the values from x as columns:

binmat <- matrix(0, nrow = length(df$sequence), ncol = length(x))
rownames(binmat) <- df$sequence
colnames(binmat) <- x

And now I'm stuck. I want to iterate through columns and rows and put a 1 in each position that has a match. For example, the first sequence is "chr9:103869456:103870456" and it has motif "ARI5B" at starting position 291, so it should get a 1 while the rest of the values in that row remain at 0. The output of this example should look like this:


                         ARI5B_291 ATOH1_702 EGR1_630 EGR1_768
chr9:103869456:103870456         1         0        0        0
chr5:30823103:30824103           1         0        0        0
chr11:49801703:49802703          1         0        0        0
chr4:133865601:133866601         0         1        0        0
chr12:55738034:55739034          0         0        1        0
chr8:96569493:96570493           0         0        0        1

But so far I am unsuccessful. I think I need a double for loop somewhere along these lines:

for (row in binmat){
  for (col in binmat){
     if (row && col %in% x){
         1
     } else { 0
     }
   }
}

But all I get are 0s.

Thanks in advance!

1 Answer 1

3

Aren't you just looking for table here? You can get the result as a vectorized one-liner, without loops, by doing:

table(factor(df$sequence, df$sequence), sprintf("%s_%d", df$motif, df$start))
                          
                           ARI5B_291 ATOH1_702 EGR1_630 EGR1_768
  chr9:103869456:103870456         1         0        0        0
  chr5:30823103:30824103           1         0        0        0
  chr11:49801703:49802703          1         0        0        0
  chr4:133865601:133866601         0         1        0        0
  chr12:55738034:55739034          0         0        1        0
  chr8:96569493:96570493           0         0        0        1
Sign up to request clarification or add additional context in comments.

4 Comments

This works perfectly on my sample code, but when I run it with my full data set I get the error: Error in levels<-(*tmp*, value = as.character(levels)) : factor level [15] is duplicated
Try factor(df$sequence, unique(df$sequence)) @LauraMorgan
New error... Error in factor(df$sequence, unique(df$sequence), sprintf("%s_%d", : invalid 'labels'; length 6461 should be 1 or 104
I think it is because I have an additional metadata column that I didn't include in my example. If I delete that column, this works!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.