1

I have a list of vectors stored

library(seqinr) mydata <- read.fasta(file="mydata.fasta")
mydatavec <- mydata[[1]] 

lst <- split(mydatavec, as.integer(gl(length(mydatavec), 100,length(mydatavec))))

df <- data.frame(matrix(unlist(lst), nrow=2057, byrow=T), stringsAsFactors=FALSE)

Now, each vector in df is 100 long and made up of letters "a", "c", "g", "t". I would like to calculate Shannon entropy of each of these vector, I will give example of what I mean:

v1 <- count(df[1,], 1) 
a  c  g  t 
27 26 24 23     

v2 <- v1/sum(v1) 
  a    c    g    t 
0.27 0.26 0.24 0.23 

v3 <- -sum(log(v2)*v2) ; print(v3) 
[1]1.384293

In total I need 2057 printed values because that is how many vectors I have. My question here, is it possible to create a for loop or repeat loop that would do this operation for me? I tried myself but I didn't get nowhere with this.

dput(head(sequence))
structure(c("function (nvec) ", "unlist(lapply(nvec, seq_len))"
), .Dim = c(2L, 1L), .Dimnames = list(c("1", "2"), ""), class = "noquote")

My attempt: I wanted to focus on the count function only and created this

A <- matrix(0, 2, 4)

for (i in 1:2) {
  A[i] <- count(df[i,], 1)
}

What the function does is it correctly calculates number of "a" in the first vector and then follows to the second one. It completely ignores the rest of the letters

A
     [,1] [,2] [,3] [,4]
[1,]   27    0    0    0
[2,]   28    0    0    0

Additionally I naively thought that adding bunch of "i" everywhere will make it work

s <- matrix(0, 1, 4)
s1 <- matrix(0, 1, 4)
s2 <- numeric(4)

for (i in 1:2) {
  s[i] <- count(df[i,],1)
  s1[i] <- s[i]/sum(s[i])
  s2[i] <- -sum(log(s1[i])*s1[i])
}

But that didn't get me anywhere either.

8
  • Can you share a piece of your data, perhaps with dput(head(mydata))? Commented Apr 30, 2018 at 20:48
  • 1
    "I tried myself but I didn't get nowhere with this." Perhaps you can give your best try. Commented Apr 30, 2018 at 20:50
  • I would love to but I have no idea how to. What do I do exactly to share it? Commented Apr 30, 2018 at 20:50
  • Copy-paste the output of dput(head(mydata))? Commented Apr 30, 2018 at 20:51
  • John Coleman, I am maths undergrad taking 1 compulsory comp.stats module. I don't know how to program, and I think I did give it my best try Commented Apr 30, 2018 at 20:51

2 Answers 2

2

If you don't need to save the count and you only need to print or save the calculation you show, these should work:

for(i in 1:dim(df)[1]{
    v1 <- count(df[i,], 1) 
    v2 <- v1/sum(v1) 
    v3 <- sum(log(v2)*v2)
    print(-v3) #for print
    entropy[i] <- v3 #for save the value in a vector, first create this vector

}

The problem with the loop that you show may be the output of count is a table class with 1 row and 4 columns and you assign that to a matrix row. Also another possible problem may be that in the assignment for example you declare s[i] <- count(df[i,],1), when should be s[i,] <- count(df[i,],1).

Sign up to request clarification or add additional context in comments.

Comments

1

Would this work for you:

df <- data.frame (x = c("a","c","g","g","g"), 
                  y = c("g","c","a","a","g"), 
                  z = c("g","t","t","a","g"),stringsAsFactors=FALSE)


A <- sapply(1:nrow(df), FUN=function(i){count(df[i,],1)})

> A
  [,1] [,2] [,3] [,4] [,5]
a    1    0    1    2    0
c    0    2    0    0    0
g    2    0    1    1    3
t    0    1    1    0    0

1 Comment

were you trying to do? xtabs(rep(1,prod(dim(df)))~values+ind,stack(df))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.