4

I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.

The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.

Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)

I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.

Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers

import scala.io.Source

object Main {
  def main(args: Array[String]) {

    // Get all of the lines from the input file
    val input = Source.fromFile("input.txt").getLines.toArray

    // Create one huge string which contains all the lines but the first
    val lines = input.tail.mkString.replace("\n","")

    val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)

  }

  def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
    for (i <- 0 until seq.length - k) {
      Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
    }
  }
}

Couple of questions:

  • How to create/generate List[Map[String,Int]]?
  • How would you do it?

Any help and/or advice is definitely appreciated!

2
  • 1
    Why do you map every kmer with 1? It looks able to be just a kmer, therefore the returned type can be just List[String]. So my suggested code is simply seq sliding k toList. Commented Oct 6, 2014 at 2:27
  • Good point, I was actually trying to do something like that just a minute ago. The 1 was there because I was attempting to adapt this bit of code to how word count works in a hadoop MapReduce job where each key gets mapped with a value of 1, and the reduce function accumulates totals for each key (kmer in this case) Commented Oct 6, 2014 at 2:33

1 Answer 1

4

You're pretty close—there are three fairly minor problems with your code.

The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.

The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.

The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.

So the following should work:

def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
  for (i <- 0 until seq.length - k) yield {
    Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
  }
}

You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.

It's worth noting, by the way, that the sliding method on Seq does exactly what you want:

scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC

I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.