1

I have some json data like below, I need to create new columns based on the some Jason values

{ "start": "1234567679", "test": ["abc"], "value": 324, "end": "1234567689" }

{ "start": "1234567679", "test": ["xyz"], "value": "Near", "end": "1234567689"}

{ "start": "1234568679", "test": ["pqr"], "value": ["Attr"," "], "end":"1234568679"}  

{ "start": "1234568997", "test": ["mno"], "value": ["{\"key\": \"1\", \"value\": [\"789\"]}" ], "end": "1234568999"} 

above is the json example

I want to create a column like below

 start      abc     xyz    pqr     mno    end
 1234567679 324     null   null    null   1234567689
 1234567889 null    Near   null    null   1234567989
 1234568679 null    null   attr    null   1234568679
 1234568997 null    null   null    789    1234568999
def getValue1(s1: Seq[String], v: String) = {
     if (s1(0)=="abc"))  v else null
} 
 
def getValue2(s1: Seq[String], v: String) = {
    if (s1(0)=="xyz"))  v else null
}  

val df = spark.read.json("path to json")

val tdf = df.withColumn("abc",getValue1($"test", $"value")).withColumn("xyz",getValue2($"test", $"value"))

But this i dont want to use because my test values are more, I want some function do something like this

def getColumnname(s1: Seq[String]) = {
    return s1(0)
}  

val tdf = df.withColumn(getColumnname($"test"),$"value"))

is it good idea to change the values to columns, I want like this because I need to apply this on some Machine learning code which needs plain columns

1 Answer 1

1

You can use pivot operations to do such things. Assuming you always have one item in your array for test column, here is the simpler solution;

import org.apache.spark.sql.functions._
val df = sqlContext.read.json("<yourPath>")
df.withColumn("test", $"test".getItem(0)).groupBy($"start", $"end").pivot("test").agg(first("value")).show
+----------+----------+----+----+
|     start|       end| abc| xyz|
+----------+----------+----+----+
|1234567679|1234567689| 324|null|
|1234567889|1234567689|null| 789|
+----------+----------+----+----+

If you have multiple values in test column, you can also use explode function;

df.withColumn("test", explode($"test")).groupBy($"start", $"end").pivot("test").agg(first("value")).show

For more information:

I hope it helps!

Update I

Based on your comments and updated question, here is the solution that you need to follow. I have intentionally seperated all operations, so you can easily understand what you need to do for further improvements;

df.withColumn("value", regexp_replace($"value", "\\[", "")). //1
   withColumn("value", regexp_replace($"value", "\\]", "")). //2
   withColumn("value", split($"value", "\\,")).              //3
   withColumn("test", explode($"test")).                     //4
   withColumn("value", explode($"value")).                   //5
   withColumn("value", regexp_replace($"value", " +", "")).  //6
   filter($"value" !== "").                                  //7
   groupBy($"start", $"end").pivot("test").                  //8
   agg(first("value")).show                                  //9
  • When you read such json files, it will give you a data frame which has value column with StringType. You can not directly convert StringType to ArrayType, so you need to do some kind of tricks like in the lines 1, 2, 3 to convert it into ArrayType. You can do these operations in one line or with just one regular expression or defining udf. It is all up to you, I'm just trying to show you the abilities of Apache Spark.

  • Now you have value column with ArrayType. Explode this column in line 5 as we did in line 4 for test column. Then apply your pivoting operations.

Sign up to request clarification or add additional context in comments.

2 Comments

hi, there is one more strange fields observred in my data, value filed is not always int, it sometimes string, Array etc .. { "start": "1234567679", "test": ["abc"], "value": 324, "end": "1234567689" } { "start": "1234567679", "test": ["xyz"], "value": "Near", "end": "1234567689"} { "start": "1234567679", "test": ["pqr"], "value": ["List"], "end": "1234567689"}
Hi, I found one more type in value field which is in json, I want to extract only the value field from the json

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.