I have a XML file that I'm trying to process through Spark-Shell using Scala. I am stuck at a point where I need to read the Array[String] using Scala's
scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21
scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23
scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25
scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="1" column2="1" column3="5" column4="2010-11-02T18:59:01.140" />", "<record column2=....
I need to read this value of the Array[String]:
"<record column1="1" column2="1" column3="5" column4="2010-11-02T18:59:01.140" />"
as XML so that I can use Scala Elem and NodeSeq classes to extract the data. So I want to do something like:
val xmlLines = fltrLines.....somehow get the value of the value of Array[String] first index
And then use xmlLines.attributes, etc.