Exploding struct type large number of columns to two columns of keys and values in Pyspark

Question

I have a pyspark df who's schema looks like this

 root
 |-- company: struct (nullable = true)
 |    |-- 0: long(nullable = true)
 |    |-- 1: long(nullable = true)
 |    |-- 10: long(nullable = true)
 |    |-- 100: long(nullable = true)
 |    |-- 101: long(nullable = true)
 |    |-- 102: long(nullable = true)
 |    |-- 103: long(nullable = true)
 |    |-- 104: long(nullable = true)
 |    |-- 105: long(nullable = true)
 |    |-- 106: long(nullable = true)
 |    |-- 107: long(nullable = true)
 |    |-- 108: long(nullable = true)
 |    |-- 109: long(nullable = true)

I want the final format of this dataframe to look like this

id    value
0     1001
1     1002
10    1004
100   1005
101   1007
102   1008

Please help me to solve this using Pyspark.

can you provide he code to recreate the input dataframe so we dont have to do that and focus on the problem. See how to create a minimal, complete verifiable example — anky
– anky, Commented Jun 24, 2020 at 15:13
It's a Json file and I have created dataframe directly from it — Naresh Tambekar
– Naresh Tambekar, Commented Jun 25, 2020 at 13:21

Som · Accepted Answer · 2020-06-25 09:23:48Z

0

Try this-

Written in scala but should be achieved in pyspark with minimal changes

Load the test data

    val df = spark.sql("select company from values (named_struct('0', 'foo', '1', 'bar')) T(company)")
    df.show(false)
    df.printSchema()
    /**
      * +----------+
      * |company   |
      * +----------+
      * |[foo, bar]|
      * +----------+
      *
      * root
      * |-- company: struct (nullable = false)
      * |    |-- 0: string (nullable = false)
      * |    |-- 1: string (nullable = false)
      */

Explode struct

    val structCols = df.schema("company").dataType.asInstanceOf[StructType].map(_.name)
    df.withColumn("company", map_from_arrays(
      array(structCols.map(lit): _*),
      array(structCols.map(c => col(s"company.$c")): _*)
    ))
      .selectExpr("explode(company) as (id, name)")
      .show(false)

    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |0  |foo |
      * |1  |bar |
      * +---+----+
      */

answered Jun 25, 2020 at 9:23

Som

6,3681 gold badge13 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Som Over a year ago

But in the question you mentioned its of string type. Check your schema in the question

Naresh Tambekar Over a year ago

Updated the Question and some of the functions are not available in Pyspark

Shubham Jain · Accepted Answer · 2020-06-25 14:52:31Z

0

In python you can convert it using stack

import pyspark.sql.functions as f
from functools import reduce
df1 = df.select('company.*')
cols = ','.join([f"'{i[0]}',`{i[1]}`" for i in zip(df1.columns,df1.columns)])


df1 = reduce(lambda df, c: df.withColumn(c, f.col(c).cast('string')), df1.columns, df1)

df1.select(f.expr(f'''stack({len(df1.columns)},{cols}) as (id, name)''')).show()

+---+----+
| id|name|
+---+----+
|  0| foo|
|  1| bar|
+---+----+

edited Jun 25, 2020 at 14:52

answered Jun 25, 2020 at 11:18

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

1 Comment

Naresh Tambekar Over a year ago

company.* i.e. struct columns are of LongType not able to convert all columns to String Type in Pyspark

Collectives™ on Stack Overflow

Exploding struct type large number of columns to two columns of keys and values in Pyspark

2 Answers 2

Load the test data

Explode struct

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Load the test data

Explode struct

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related