2

I have a "hit list" of genes in a matrix. Each row is a hit, and the format is "chromosome(character) start(a number) stop(a number)." I would like to see which of these hits overlap with genes in the fly genome, which is a matrix with the format "chromosome start stop gene"

I have the following function that works (prints a list of genes from column 4 of dmelGenome):

geneListBuild <- function(dmelGenome='', hitList='', binSize='', saveGeneList='')

{
genomeColumns <- c('chr', 'start', 'stop', 'gene')
genome <- read.table(dmelGenome, header=FALSE, col.names = genomeColumns)

chr <- genome[,1]
startAdjust <- genome[,2] - binSize
stopAdjust <- genome[,3] + binSize
gene <- genome[,4]

genome <- data.frame(chr, startAdjust, stopAdjust, gene)

hits <- read.table(hitList, header=TRUE)

chrHits <- hits[hits$chr == "chr3R",]
chrGenome <- genome[genome$chr == "chr3R",]

genes <- c()

for(i in 1:length(chrHits[,1])) 
{
    for(j in 1:length(chrGenome[,1]))   
    {
        if( chrHits[i,2] >= chrGenome[j,2]  &&  chrHits[i,3] <= chrGenome[j,3] )
        {
            print(chrGenome[j,4])
        }
    }
}

genes <- unique(genes[is.finite(genes)])

print(genes)

fileConn<-file(saveGeneList) 
write(genes, fileConn) 
close(fileConn) 

}

however, when I substitute print() with:

genes[j] <- chrGenome[j,4]

R returns a vector that has some values that are present in chrGenome[,1]. I don't know how it chooses these values, because they aren't in rows that seem to fulfill the if statement. I think it's an indexing issue?

Also I'm sure that there is a more efficient way of doing this. I'm new to R, so my code isn't very efficient.

This is similar to the "writing the results from a nested loop into another vector in R," but I couldn't fix it with the information in that thread.

Thanks.

2
  • 1
    This is a big chunk of code that we can't reproduce (no sample data), so it's going to be really tricky to answer. Please can you (1) provide data and (2) try and more accurately locate the source of the problem. Commented Oct 25, 2011 at 14:46
  • 1
    This is very, very similar (making me wonder if user#### is crosspoting) to a question that was definitely crossposted to r-help and the BioC lists. There is good support for such activities in BioConductor: stat.ethz.ch/pipermail/bioconductor/2011-October/041776.html Commented Oct 25, 2011 at 16:01

1 Answer 1

4

I believe the inner loop could be replaced with:

gene.in <- ifelse( chrHits[i,2] >= chrGenome[,2] &  chrHits[i,3] <= chrGenome[,3], 
    TRUE, FALSE)

Then you can use that logical vector to select what you want. Doing

which(gene.in)

might also be of use to you.

Sign up to request clarification or add additional context in comments.

2 Comments

Or even gene.in <- chrHits[i,2] >= chrGenome[,2] & chrHits[i,3] <= chrGenome[,3].
Thanks - I tried Richie's modification of Patrick's suggestion, but it only returned one value, and I know that there are more than one hits since I can see them when I use print(). I'm crunched for time right now, and am just copying the printed list to a txt file. I'm not crossposting, but thanks for pointing me to the bioconductor thread! I'll check it out. Richie - the problem occurs if I remove "print(chrGenome[j,4])" and try to send the output to an object. I believe that I am indexing incorrectly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.