Spark SQL's DataFrameReader supports so-called JSON Lines text format (aka newline-delimited JSON) where:
Each Line is a Valid JSON Value
You can use json operator to read the dataset.
// on command line
$ cat subjects.jsonl
{ "name" : "James", "subjects" : [ "english", "french", "botany" ] }
{ "name" : "neo", "subjects" : [ "english", "physics" ] }
{ "name" : "john", "subjects" : [ "spanish", "mathematics" ] }
// in spark-shell
scala> val subjects = spark.read.json("subjects.jsonl")
subjects: org.apache.spark.sql.DataFrame = [name: string, subjects: array<string>]
scala> subjects.show(truncate = false)
+-----+-------------------------+
|name |subjects |
+-----+-------------------------+
|James|[english, french, botany]|
|neo |[english, physics] |
|john |[spanish, mathematics] |
+-----+-------------------------+
scala> subjects.printSchema
root
|-- name: string (nullable = true)
|-- subjects: array (nullable = true)
| |-- element: string (containsNull = true)
With that, you should have a look at functions library when you can find Collection functions that deal with array-based inputs, e.g. array_contains or explode.
That's what you can find in the answer from @Vidya.
What is missing is my beloved Dataset.flatMap that, given the subjects Dataset, could be used as follows:
scala> subjects
.as[(String, Seq[String])] // convert to Dataset[(String, Seq[String])] for more type-safety
.flatMap { case (student, subjects) => subjects.map(s => (student, s)) } // typed expand
.filter(_._2.toLowerCase == "english") // filter out non-english subjects
.show
+-----+-------+
| _1| _2|
+-----+-------+
|James|english|
| neo|english|
+-----+-------+
That however doesn't look as good/nice as its for-comprehension version.
val subjectsDF = subjects.as[(String, Seq[String])]
val englishStudents = for {
(student, ss) <- subjectsDF // flatMap
subject <- ss // map
if subject.toLowerCase == "english"
} yield (student, subject)
scala> englishStudents.show
+-----+-------+
| _1| _2|
+-----+-------+
|James|english|
| neo|english|
+-----+-------+
Moreover, as of Spark 2.2 (soon to be released), you've got DataFrameReader.json operator that you can use to read a Dataset[String].
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.Dataset
val subjects: Dataset[String] = Seq(
"""{ "name" : "James", "subjects" : [ "english", "french", "botany" ] }""",
"""{ "name" : "neo", "subjects" : [ "english", "physics" ] }""",
"""{ "name" : "john", "subjects" : [ "spanish", "mathematics" ]}""").toDS
scala> spark.read.option("inferSchema", true).json(subjects).show(truncate = false)
+-----+-------------------------+
|name |subjects |
+-----+-------------------------+
|James|[english, french, botany]|
|neo |[english, physics] |
|john |[spanish, mathematics] |
+-----+-------------------------+