0

I map a data taken from a text file. The text file is supposed to have 5 columns. e.g.

29000000    1   0   2013    1   single-sex
29000000    1   0   2013    1   education
29000000    1   0   2013    1   and
29000000    1   0   2013    1   the
29000000    1   0   2013    1   brain

In my process I need only those values appearing on the 0th and 5th columns. So to get those I wrote the following:

val emp = 
  sc.textFile("\\.txt")
    .map{line => val s = line.split("\t"); (s(5),s(0))}

However, it is possible that sometimes 5th column does not exist for some rows and I get

15/10/12 17:19:33 INFO TaskSetManager: Lost task 27.0 in stage 0.0 (TID 27) on executor localhost: java.lang.ArrayIndexOutOfBoundsException (5)

So on my mapping how should I write a if condition if s(5) exists or not?

3
  • 3
    What does the business logic of your application require you to do in such a case? I mean, should you ignore those records or should you behave in a different way? Commented Oct 12, 2015 at 15:39
  • 3
    You seem to know that the array is zero-based, and since their are five items, this means the indices are 0, 1, 2, 3, 4. 5 means the sixth row. Commented Oct 12, 2015 at 15:45
  • There are 6 fields in my post. Beryllium's way is working well. Commented Oct 14, 2015 at 16:00

1 Answer 1

6

You can add a filter() in between:

val rdd = 
  sc.textFile("...").map(_.split("\t")).filter(_.size > 5).map(a => (a(0), a(5)))

Another option using flatMap (combined with extraction "on-the-fly"):

val rdd = sc.textFile("...").flatMap { l => 
  l.split("\t") match {
    case Array(x: String, _, _, _, _, y: String) => Some((x, y))
    case _ => None 
  }
}

The condition can be expressed with a guard as well (together with pattern matching on class Array[String]:

val rdd = sc.textFile("...").flatMap { l => 
  l.split("\t") match {
    case a: Array[String] if a.size > 5 => Some((a(0), a(5)))
    // Only one column, provide a default for the other
    case a: Array[String] if a.size == 1 => Some((a(0), "default value"))
    // Ignore everything else
    case _ => None 
  }
}

In case of flatMap you can handle any number of non-matching lines as separate cases.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.