Extracting column names from strings inside columns:
- create a proper JSON string (with quote symbols around json objects and values)
- create schema using this column
- create struct and explode it into columns
Input example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(12, ['a:123', 'b:125', 'c:456']),
(13, ['a:443', 'b:225', 'c:126'])],
['id', 'array_col'])
df.show(truncate=0)
# +---+---------------------+
# |id |array_col |
# +---+---------------------+
# |12 |[a:123, b:125, c:456]|
# |13 |[a:443, b:225, c:126]|
# +---+---------------------+
Script:
df = df.withColumn("array_col", F.expr("to_json(str_to_map(array_join(array_col, ',')))"))
json_schema = spark.read.json(df.rdd.map(lambda row: row.array_col)).schema
df = df.withColumn("array_col", F.from_json("array_col", json_schema))
df = df.select("*", "array_col.*").drop("array_col")
df.show()
# +---+---+---+---+
# | id| a| b| c|
# +---+---+---+---+
# | 12|123|125|456|
# | 13|443|225|126|
# +---+---+---+---+