3

I have created the below data frame from an rdd using reducebyKey. I want to split the first column (originally the key) into 2 new columns which are split by the comma.

scala> result_winr_table.schema
res10: org.apache.spark.sql.types.StructType = StructType(StructField(_1,StructType(StructField(_1,IntegerType,false), StructField(_2,IntegerType,false)),true), StructField(_2,DoubleType,false))

scala> result_winr_table
    res5: org.apache.spark.sql.DataFrame = [_1: struct<_1:int,_2:int>, _2: double]

    scala> result_winr_table.show
    +--------+-------------------+
    |      _1|                 _2|
    +--------+-------------------+
    | [31,88]|              0.475|
    | [18,91]| 0.5833333333333334|
    | [56,95]|0.37142857142857144|
    | [70,61]| 0.6266666666666667|
    |[104,11]| 0.4527911784975879|
    | [42,58]| 0.6857142857142857|
    | [13,82]| 0.3333333333333333|
    | [30,18]|0.49310344827586206|
    | [99,18]|0.44285714285714284|
    | [53,31]| 0.2981366459627329|
    | [52,84]| 0.4444444444444444|
    | [60,38]|               0.38|
    |  [79,9]|0.36666666666666664|
    | [20,85]| 0.4389312977099237|
    | [61,87]| 0.4807692307692308|
    |  [3,67]| 0.4245810055865922|
    | [62,84]|0.47796610169491527|
    |  [9,32]| 0.4727272727272727|
    | [94,44]| 0.5698324022346368|
    | [50,67]|0.45083487940630795|
    +--------+-------------------+

I tried using split method on the column directly but it didnt work because of a type mis-match.

What would be the best way to achieve this?

3 Answers 3

5

Given that the schema is

root
 |-- _1: struct (nullable = true)
 |    |-- _1: integer (nullable = false)
 |    |-- _2: integer (nullable = false)
 |-- _2: double (nullable = false)

You can use withColumn api as following

result_winr_table.withColumn("first", $"_1._1")
  .withColumn("second", $"_1._2")

and if you don't want the original column, you can use .drop("_1")

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Adheip for accepting :) you can upvote too if the answer is helpful
1

If you have a complex struct that you don't know the property names for at compile time you can do this:

case class Foo(a: Int, b: String, c: Boolean)
val df = Seq( (1, Foo(2, "three", false)), (2, Foo(4, "five", true)) ).toDF("id", "foo")

df.show
+---+-----------------+
| id|              foo|
+---+-----------------+
|  1|[2, three, false]|
|  2|  [4, five, true]|
+---+-----------------+

df.select($"*", $"foo.*").show
+---+-----------------+---+-----+-----+
| id|              foo|  a|    b|    c|
+---+-----------------+---+-----+-----+
|  1|[2, three, false]|  2|three|false|
|  2|  [4, five, true]|  4| five| true|
+---+-----------------+---+-----+-----+

Comments

0

Folks looking to solve this using Spark sql, as I was, the following sql query will work to flatten your dataframe/table in spark 1.6+:

sqlContext.sql(s""" select _1["_1"] as col1, _1["_2"] as col2, _2 as col3 from result_winr_table """)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.