How to parse json column in dataframe in scala [duplicate]

Question

I have a data frame which is json column with json string. example below. There are 3 columns - a,b,c. Column c is stringType

| a         | b    |           c                       |
--------------------------------------------------------
|77         |ABC   |    {"12549":38,"333513":39}       |
|78         |ABC   |    {"12540":38,"333513":39}       |

I want to make them into columns of the data frame(pivot). the example below -

| a         | b    | 12549  | 333513 | 12540
---------------------------------------------
|77         |ABC   |38      |39      | null
|77         |ABC   | null   |39      | 38

So what would be the logic in that case? Could you provide an example that fully reflects what you are trying to do? (and possibly a more extensive explanation) — Oli
– Oli, Commented Mar 25, 2019 at 13:42

Travis Hegner · Accepted Answer · 2019-03-25 15:24:48Z

7

This may not be the most efficient, as it has to read all of the json records an extra time to infer the schema. If you can statically define the schema, it should do better.

val data = spark.createDataset(Seq(
  (77, "ABC", "{\"12549\":38,\"333513\":39}"),
  (78, "ABC", "{\"12540\":38,\"333513\":39}")
)).toDF("a", "b", "c")

val schema = spark.read.json(data.select("c").as[String]).schema

data.select($"a", $"b", from_json($"c", schema).as("s")).select("a", "b", "s.*").show(false)

Result:

+---+---+-----+-----+------+
|a  |b  |12540|12549|333513|
+---+---+-----+-----+------+
|77 |ABC|null |38   |39    |
|78 |ABC|38   |null |39    |
+---+---+-----+-----+------+

answered Mar 25, 2019 at 15:24

Travis Hegner

2,4951 gold badge14 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

lucy Over a year ago

great. But I am getting _corrupt_record column also

Travis Hegner Over a year ago

Some of your json data is either corrupt or contains newlines. You can try with a multiline option val schema = spark.read.option("multiline", true).json(data.select("c").as[String]).schema or you'll have to filter or correct the corrupt data: source

Hang Wu Over a year ago

How to make the header be something like "a, b, json_12540, json_12549, json_333513"?

Travis Hegner Over a year ago

To do this dynamically, I think you'd have to iterate df.columns and selectively rename them individually with .withColumnRenamed(). I will reiterate, however, that a static schema would be much more performant in this case.

Collectives™ on Stack Overflow

How to parse json column in dataframe in scala [duplicate]

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related