0

I need to expand the Json object (column b) to multiple columns.

From this table,

Column A Column B
id1 [{a:1,b:'letter1'}]
id2 [{a:1,b:'letter2',c:3,d:4}]

To this table,

Column A a b c d
id1 1 2
id2 1 2 3 4

I have tried transforming the dataframe from local and spark, but neither worked.

From the local, I extracted the kv in column B with multiple loops (this step was succussed).

But when I tried to turn the extracted kv (in dictionary structure) to dataframe, this error occurred: "ValueError: All arrays must be of the same length".

Because key c and key d from Json object are empty. So failed to do so.

Below answer was referred in this case.

Expand Dataframe containing JSON object into larger dataframe


From spark, I got type error (something like longtype and stringtype can't be regconized) when transformed the pandas dataframe to spark dataframe.

So, I transformed the pandas dataframe to string type df.astype(str), and I could transform it to spark dataframe.

def func(df):    
    spark = (
        SparkSession.builder.appName("data")
        .enableHiveSupport()
        .getOrCreate()
    )
    

    df1 = spark.createDataFrame(df)

Now, when I tried to expand it ...

for i in df.columns:
 if i == 'a_column':
    # Since the rows became string instead of list.
    # I need to remove the first and last characters which are the [ and ].
    # But I get error here: failed due to error Column is not iterable
    df.withColumn(i,substring(i,2,length(i)))
    df.withColumn(i,substring(i,1,length(i)-1))
    
    # transform each row (json string) to json object
    # But I get error here: ValueError: 'json' is not in list ; AttributeError: json
    # I assume the x.json means convert row to json obejct?
    df = df.map(lambda x:x.json)
    print(df.take(10))

Below answer was referred in this case. I can't hardcode the schemas as there are a lots of different JSON columns.

Pyspark: explode json in column to multiple columns

Pyspark: Parse a column of json strings

Someone please help. Could you show me how to do it from local and spark?

Every appreciate.

1 Answer 1

0

To facilitate data processing and transformation, I recommend implementing the same functionality in the following manner:

  • Convert the local pandas DataFrame to a spark DataFrame without parsing the JSON column.
  • Implement the JSON column parsing functionality in the Spark environment.

You can implement the same functionality using only the built-in SQL functions in Spark as shown below.

import json
import pandas as pd
from pyspark.sql.functions import col, expr, map_entries, from_json, get
from pyspark.sql.types import ArrayType, MapType, StringType

# Sample data
data = [
    ("id1", json.dumps([{"a": 1, "b": 2}])),
    ("id2", json.dumps([{"a": 1, "b": 2, "c": 3, "d": 4}])),
]

# Make Pandas Dataframe
pandas_df = pd.DataFrame(data, columns=["id", "B"])

# Make Spark DataFrame with string type json column.
# Main Part 1
json_schema = ArrayType(MapType(StringType(), StringType()))
df = spark.createDataFrame(pandas_df)
df = df.withColumn("B", get(from_json(col("B"), json_schema), 0))

# Main Part 2
df = df.withColumn("entries", map_entries(col("B")))

# Main Part 3
keys = df.selectExpr("explode(entries.key)").distinct().rdd.flatMap(lambda x: x).collect()

# Main Part 4
for key in keys:
    df = df.withColumn(key, expr(f"transform(entries, x -> if(x.key = '{key}', x.value, NULL))"))
    df = df.withColumn(key, expr(f"get(filter({key}, x -> x is not NULL), 0)"))

df = df.drop(col("entries"))

df.display()

The above code consists of four main parts:

  1. Code to convert a column containing JSON data in String format to a proper JSON format.
  2. Converting the JSON to an array type that can be iterated over.
  3. Extracting all key values from the JSON to create columns.
  4. Using transform and filter to extract the data matching the created columns.

However, this method might not be the best option. Writing a UDF in Scala could result in simpler code that runs more efficiently.

If a type error issue still occurs in Spark using this approach, please leave a comment.

I hope my answer has been helpful.

Thank you.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.