spark assign column name for withColumn function from variable fields

Question

I have some json data like below, I need to create new columns based on the some Jason values

{ "start": "1234567679", "test": ["abc"], "value": 324, "end": "1234567689" }

{ "start": "1234567679", "test": ["xyz"], "value": "Near", "end": "1234567689"}

{ "start": "1234568679", "test": ["pqr"], "value": ["Attr"," "], "end":"1234568679"}  

{ "start": "1234568997", "test": ["mno"], "value": ["{\"key\": \"1\", \"value\": [\"789\"]}" ], "end": "1234568999"}

above is the json example

I want to create a column like below

 start      abc     xyz    pqr     mno    end
 1234567679 324     null   null    null   1234567689
 1234567889 null    Near   null    null   1234567989
 1234568679 null    null   attr    null   1234568679
 1234568997 null    null   null    789    1234568999

def getValue1(s1: Seq[String], v: String) = {
     if (s1(0)=="abc"))  v else null
} 
 
def getValue2(s1: Seq[String], v: String) = {
    if (s1(0)=="xyz"))  v else null
}  

val df = spark.read.json("path to json")

val tdf = df.withColumn("abc",getValue1($"test", $"value")).withColumn("xyz",getValue2($"test", $"value"))

But this i dont want to use because my test values are more, I want some function do something like this

def getColumnname(s1: Seq[String]) = {
    return s1(0)
}  

val tdf = df.withColumn(getColumnname($"test"),$"value"))

is it good idea to change the values to columns, I want like this because I need to apply this on some Machine learning code which needs plain columns

ulubeyn · Accepted Answer · 2018-09-29 14:01:25Z

1

You can use pivot operations to do such things. Assuming you always have one item in your array for test column, here is the simpler solution;

import org.apache.spark.sql.functions._
val df = sqlContext.read.json("<yourPath>")
df.withColumn("test", $"test".getItem(0)).groupBy($"start", $"end").pivot("test").agg(first("value")).show
+----------+----------+----+----+
|     start|       end| abc| xyz|
+----------+----------+----+----+
|1234567679|1234567689| 324|null|
|1234567889|1234567689|null| 789|
+----------+----------+----+----+

If you have multiple values in test column, you can also use explode function;

df.withColumn("test", explode($"test")).groupBy($"start", $"end").pivot("test").agg(first("value")).show

For more information:

I hope it helps!

Update I

Based on your comments and updated question, here is the solution that you need to follow. I have intentionally seperated all operations, so you can easily understand what you need to do for further improvements;

df.withColumn("value", regexp_replace($"value", "\\[", "")). //1
   withColumn("value", regexp_replace($"value", "\\]", "")). //2
   withColumn("value", split($"value", "\\,")).              //3
   withColumn("test", explode($"test")).                     //4
   withColumn("value", explode($"value")).                   //5
   withColumn("value", regexp_replace($"value", " +", "")).  //6
   filter($"value" !== "").                                  //7
   groupBy($"start", $"end").pivot("test").                  //8
   agg(first("value")).show                                  //9

When you read such json files, it will give you a data frame which has value column with StringType. You can not directly convert StringType to ArrayType, so you need to do some kind of tricks like in the lines 1, 2, 3 to convert it into ArrayType. You can do these operations in one line or with just one regular expression or defining udf. It is all up to you, I'm just trying to show you the abilities of Apache Spark.
Now you have value column with ArrayType. Explode this column in line 5 as we did in line 4 for test column. Then apply your pivoting operations.

edited Sep 29, 2018 at 14:01

answered Sep 29, 2018 at 12:24

ulubeyn

3,0821 gold badge22 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

scoder Over a year ago

hi, there is one more strange fields observred in my data, value filed is not always int, it sometimes string, Array etc .. { "start": "1234567679", "test": ["abc"], "value": 324, "end": "1234567689" } { "start": "1234567679", "test": ["xyz"], "value": "Near", "end": "1234567689"} { "start": "1234567679", "test": ["pqr"], "value": ["List"], "end": "1234567689"}

scoder Over a year ago

Hi, I found one more type in value field which is in json, I want to extract only the value field from the json

Collectives™ on Stack Overflow

spark assign column name for withColumn function from variable fields

1 Answer 1

Update I

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Update I

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related