3

How to flatten Array of Strings into multiple rows of a dataframe in Spark 2.2.0?

Input Row ["foo", "bar"]

val inputDS = Seq("""["foo", "bar"]""").toDF

inputDS.printSchema()

root
 |-- value: string (nullable = true)

Input Dataset inputDS

inputDS.show(false)

value
-----
["foo", "bar"]

Expected output dataset outputDS

value
-------
"foo" |
"bar" |

I tried explode function like below but it didn't quite work

inputDS.select(explode(from_json(col("value"), ArrayType(StringType))))

and I get the following error

org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`value`)' due to data type mismatch: Input schema string must be a struct or an array of structs

Also tried the following

inputDS.select(explode(col("value")))

And I get the following error

org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`value`)' due to data type mismatch: input to function explode should be array or map type, not StringType
4
  • If you simply have an Array of string then you do not need the from_json part. Simply try inputDS.select(explode(col("value"))). Commented Oct 9, 2017 at 7:18
  • Tried that earlier and tried it again just now. I get the following error org.apache.spark.sql.AnalysisException: cannot resolve 'explode(value)' due to data type mismatch: input to function explode should be array or map type, not StringType Commented Oct 9, 2017 at 7:21
  • Looks like you do not actually have an Array, but a string. An option would be to look into the split function and use that together with explode. Can you check the input again and update the question? Commented Oct 9, 2017 at 7:27
  • updated and that is exactly what I have and those are exact errors I get Commented Oct 9, 2017 at 7:32

3 Answers 3

7

Exception is thrown by:

from_json(col("value"), ArrayType(StringType))

not explode, specifically:

Input schema array must be a struct or an array of structs.

You can:

inputDS.selectExpr(
  "split(substring(value, 2, length(value) - 2), ',\\s+') as value")

and explode the output.

Sign up to request clarification or add additional context in comments.

Comments

0

The issue above should be fixed in Spark 2.4.0 (https://jira.apache.org/jira/browse/SPARK-24391) So you can use this from_json($"column_nm", ArrayType(StringType)) without any problems.

Comments

-2

You can simply achieve using flatMap.

val input = sc.parallelize(Array("foo", "bar")).toDS()
val out = input.flatMap(x => x.split(","))
out.collect.foreach{println}

2 Comments

sorry this is not what I am looking for since this is not how I have it in my code. The question here is a simple version of my bigger problem.
I did not give you a down vote. I am not sure who did

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.