-2

I have a data frame which is json column with json string. example below. There are 3 columns - a,b,c. Column c is stringType

| a         | b    |           c                       |
--------------------------------------------------------
|77         |ABC   |    {"12549":38,"333513":39}       |
|78         |ABC   |    {"12540":38,"333513":39}       |

I want to make them into columns of the data frame(pivot). the example below -

| a         | b    | 12549  | 333513 | 12540
---------------------------------------------
|77         |ABC   |38      |39      | null
|77         |ABC   | null   |39      | 38
10
  • Does the json always have the same format? Commented Mar 25, 2019 at 12:51
  • I think you need more cleaner explanation. Commented Mar 25, 2019 at 13:15
  • @Oli json columns are not fixed. Json is always same format Commented Mar 25, 2019 at 13:41
  • So what would be the logic in that case? Could you provide an example that fully reflects what you are trying to do? (and possibly a more extensive explanation) Commented Mar 25, 2019 at 13:42
  • updated question. Commented Mar 25, 2019 at 13:43

1 Answer 1

7

This may not be the most efficient, as it has to read all of the json records an extra time to infer the schema. If you can statically define the schema, it should do better.

val data = spark.createDataset(Seq(
  (77, "ABC", "{\"12549\":38,\"333513\":39}"),
  (78, "ABC", "{\"12540\":38,\"333513\":39}")
)).toDF("a", "b", "c")

val schema = spark.read.json(data.select("c").as[String]).schema

data.select($"a", $"b", from_json($"c", schema).as("s")).select("a", "b", "s.*").show(false)

Result:

+---+---+-----+-----+------+
|a  |b  |12540|12549|333513|
+---+---+-----+-----+------+
|77 |ABC|null |38   |39    |
|78 |ABC|38   |null |39    |
+---+---+-----+-----+------+
Sign up to request clarification or add additional context in comments.

4 Comments

great. But I am getting _corrupt_record column also
Some of your json data is either corrupt or contains newlines. You can try with a multiline option val schema = spark.read.option("multiline", true).json(data.select("c").as[String]).schema or you'll have to filter or correct the corrupt data: source
How to make the header be something like "a, b, json_12540, json_12549, json_333513"?
To do this dynamically, I think you'd have to iterate df.columns and selectively rename them individually with .withColumnRenamed(). I will reiterate, however, that a static schema would be much more performant in this case.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.