How to parse string to array in Spark?

Question

How to flatten Array of Strings into multiple rows of a dataframe in Spark 2.2.0?

Input Row ["foo", "bar"]

val inputDS = Seq("""["foo", "bar"]""").toDF

inputDS.printSchema()

root
 |-- value: string (nullable = true)

Input Dataset inputDS

inputDS.show(false)

value
-----
["foo", "bar"]

Expected output dataset outputDS

value
-------
"foo" |
"bar" |

I tried explode function like below but it didn't quite work

inputDS.select(explode(from_json(col("value"), ArrayType(StringType))))

and I get the following error

org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`value`)' due to data type mismatch: Input schema string must be a struct or an array of structs

Also tried the following

inputDS.select(explode(col("value")))

And I get the following error

org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`value`)' due to data type mismatch: input to function explode should be array or map type, not StringType

If you simply have an Array of string then you do not need the from_json part. Simply try inputDS.select(explode(col("value"))). — Shaido
– Shaido, Commented Oct 9, 2017 at 7:18
Tried that earlier and tried it again just now. I get the following error org.apache.spark.sql.AnalysisException: cannot resolve 'explode(value)' due to data type mismatch: input to function explode should be array or map type, not StringType — user1870400
– user1870400, Commented Oct 9, 2017 at 7:21
Looks like you do not actually have an Array, but a string. An option would be to look into the split function and use that together with explode. Can you check the input again and update the question? — Shaido
– Shaido, Commented Oct 9, 2017 at 7:27
updated and that is exactly what I have and those are exact errors I get — user1870400
– user1870400, Commented Oct 9, 2017 at 7:32

Alper t. Turker · Accepted Answer · 2017-10-09 07:42:06Z

7

Exception is thrown by:

from_json(col("value"), ArrayType(StringType))

not explode, specifically:

Input schema array must be a struct or an array of structs.

You can:

inputDS.selectExpr(
  "split(substring(value, 2, length(value) - 2), ',\\s+') as value")

and explode the output.

answered Oct 9, 2017 at 7:42

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Greg · Accepted Answer · 2019-01-03 15:44:24Z

0

The issue above should be fixed in Spark 2.4.0 (https://jira.apache.org/jira/browse/SPARK-24391) So you can use this from_json($"column_nm", ArrayType(StringType)) without any problems.

edited Jan 3, 2019 at 15:44

answered Jan 3, 2019 at 15:36

Greg

111 bronze badge

Comments

Danny Varod · Accepted Answer · 2023-02-13 20:57:33Z

-2

You can simply achieve using flatMap.

val input = sc.parallelize(Array("foo", "bar")).toDS()
val out = input.flatMap(x => x.split(","))
out.collect.foreach{println}

edited Feb 13, 2023 at 20:57

Danny Varod

18.3k5 gold badges76 silver badges116 bronze badges

answered Oct 9, 2017 at 7:25

Vignesh I

2,2412 gold badges23 silver badges40 bronze badges

2 Comments

user1870400 Over a year ago

sorry this is not what I am looking for since this is not how I have it in my code. The question here is a simple version of my bigger problem.

user1870400 Over a year ago

I did not give you a down vote. I am not sure who did

Collectives™ on Stack Overflow

How to parse string to array in Spark?

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related