Extracting elements from XML records using Spark / Scala

Question

I'm trying to extract elements from XML records where each xml file has many XML records. Below is the modified code and sample xmls that I'm using.

I'm expecting an array of Strings where each element of the array is "user:id" but the result is ":". I was expecting XML.loadString to parse each file and the result would be separate XML records. Meaning if I take the two sample files as example I would end up with 4 XML records. As it is, it's two.

After adding a println(d) after getting next what I get is the entire string that represents the file which is likely why the getId and getUser functions are not returning anything.

Am I handling the load incorrectly?

import org.apache.spark.{SparkConf, SparkContext}
import scala.xml._
import scala.collection.mutable.ArrayBuffer

object Details {

    def getDetails(xmlstring: String): Iterator[Node] = {
        val nodes = XML.loadString(xmlstring)
        nodes.toIterator
    }

    def getId(detail: Node): String = {
        (detail \ "id").text
    }

    def getUser(detail: Node): String = {
        (detail \ "user").text
    }

    def getDetailList(details: Iterator[Node]): Array[String] = {
        var list = ArrayBuffer[String]()
        while (details.hasNext) {
            val d = details.next
            val user = getUser(d)
            val id = getId(d)
            val formattedText = user + ":" + id
            list += formattedText
        }
        list.toArray
    }

    def main(args: Array[String]) {

        val conf = new SparkConf().setAppName("Details")
        val sc: SparkContext = new SparkContext(conf)

        val lines = sc.wholeTextFiles("file:///path/to/files/")
        val xmlStrings = lines.map(line => line._2)
        val detailsRecords = xmlStrings.map(getDetails)
        val detailsList = detailsRecords.map(getDetailList)

        spark.stop()
    }
}

And two sample files...

test.xml

<details>
  <detail>
    <user>Dan</user>
    <id>5555</id>
  </detail>
  <detail>
    <user>Mike</user>
    <id>6666</id>
  </detail>
</details>

test2.xml

<details>
  <detail>
    <user>John</user>
    <id>1234</id>
  </detail>
  <detail>
    <user>Joe</user>
    <id>5678</id>
  </detail>
</details>

illak zapata · Accepted Answer · 2018-10-09 02:17:23Z

1

You should use XML for Spark.

With this library you can read all your xml files like this:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)

val df = sqlContext.read
   .format("com.databricks.spark.xml")
   .option("rowTag", "detail")
   .load("/home/path-with-xml-files")

This generates a DataFrame with schema:

+----+----+
|  id|user|
+----+----+
|5555| Dan|
|6666|Mike|
|1234|John|
|5678| Joe|
+----+----+

Then get an array from this DF:

val id_users_array = df.collect

This array has the type:

id_users_array: Array[org.apache.spark.sql.Row] = Array([5555,Dan], [6666,Mike], [1234,John], [5678,Joe])

If you want to print only the ids:

id_users_array.map(r => r.get(0)).foreach(println)

outputs:

Hope this helps.

answered Oct 9, 2018 at 2:17

illak zapata

662 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tpysz5n · Accepted Answer · 2019-03-06 00:09:39Z

1

It's been 4 months late but I think I got just the answer for you.

The problem lies in the getDetails() function. You have to tell Scala what is defined as a "node", which is <detail> in this case. So just modify your code as below:

  def getDetails(xmlstring: String): Iterator[Node] = {
    val nodes = XML.loadString(xmlstring) \\ "detail"
    nodes.toIterator
  }

Appending \\ "detail" at the end of XML.loadString() is all you need to get the code working as you expect.

Cheers,

answered Mar 6, 2019 at 0:09

tpysz5n

114 bronze badges

Collectives™ on Stack Overflow

Extracting elements from XML records using Spark / Scala

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related