0

I need to define a test sample with ArrayType for Spark to read this data. Here is how the schema of data looks like:

 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- stat: float (nullable = true)
 |-- naming: string (nullable = true)

My current definition of data field shows null values for all rows so how can I define structurally this data in a CSV file?

Here is how my CSV files structure looks like now:

"data1_id","data1_stat","data2_id","data2_stat","data3_id","data3_stat","naming"
"1","0.76","2","0.55","3","0.16","Default1"
"1","0.2","2","0.41","3","0.89","Default2"
"1","0.96","2","0.12","3","0.4","Default3"
"1","0.28","2","0.15","3","0.31","Default4"
"1","0.84","2","0.41","3","0.15","Default5"

When I call show on input dataframe I get this result:

+-------+-----------+
|data   |naming     |                                                                                  
+-------+-----------+
|null   |Default1   |
|null   |Default2   |
|null   |Default3   |
|null   |Default4   |
|null   |Default5   |
+-------+-----------+

Expected result:

+----------------------------+-----------+
|data                        |naming     |                                                                                  
+----------------------------+-----------+
|[[1,0.76],[2,0.55],[3,0.16]]|Default1   |
|[[1,0.2],[2,0.41],[3,0.89]] |Default2   |
|[[1,0.96],[2,0.12],[3,0.4]] |Default3   |
|[[1,0.28],[2,0.15],[3,0.31]]|Default4   |
|[[1,0.84],[2,0.41],[3,0.15]]|Default5   |
+----------------------------+-----------+

1 Answer 1

2

You have to transform data & Construct expressions like array(struct(<add your columns>))

scala> df.show(false)
+--------+----------+--------+----------+--------+----------+--------+
|data1_id|data1_stat|data2_id|data2_stat|data3_id|data3_stat|naming  |
+--------+----------+--------+----------+--------+----------+--------+
|1       |0.76      |2       |0.55      |3       |0.16      |Default1|
|1       |0.2       |2       |0.41      |3       |0.89      |Default2|
|1       |0.96      |2       |0.12      |3       |0.4       |Default3|
|1       |0.28      |2       |0.15      |3       |0.31      |Default4|
|1       |0.84      |2       |0.41      |3       |0.15      |Default5|
+--------+----------+--------+----------+--------+----------+--------+

Extracting required columns for array

scala> val arrayColumns = df
            .columns
            .filter(_.contains("data"))
            .map(_.split("_")(0))
            .distinct
            .map(c => struct(col(s"${c}_id").as("id"),col(s"${c}_stat").as("stat")))

scala> val colExpr = array(arrayColumns:_*).as("data")

Applying colExpr expression to DataFrame.

scala> val finalDf = df.select(colExpr,$"naming")

Schema

scala> finalDf.printSchema
root
 |-- data: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- id: string (nullable = true)
 |    |    |-- stat: string (nullable = true)
 |-- naming: string (nullable = true)

Result

scala> finalDf.show(false)
+------------------------------+--------+
|data                          |naming  |
+------------------------------+--------+
|[[1,0.76], [2,0.55], [3,0.16]]|Default1|
|[[1,0.2], [2,0.41], [3,0.89]] |Default2|
|[[1,0.96], [2,0.12], [3,0.4]] |Default3|
|[[1,0.28], [2,0.15], [3,0.31]]|Default4|
|[[1,0.84], [2,0.41], [3,0.15]]|Default5|
+------------------------------+--------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.