1

I have a pyspark df who's schema looks like this

 root
 |-- company: struct (nullable = true)
 |    |-- 0: long(nullable = true)
 |    |-- 1: long(nullable = true)
 |    |-- 10: long(nullable = true)
 |    |-- 100: long(nullable = true)
 |    |-- 101: long(nullable = true)
 |    |-- 102: long(nullable = true)
 |    |-- 103: long(nullable = true)
 |    |-- 104: long(nullable = true)
 |    |-- 105: long(nullable = true)
 |    |-- 106: long(nullable = true)
 |    |-- 107: long(nullable = true)
 |    |-- 108: long(nullable = true)
 |    |-- 109: long(nullable = true)

I want the final format of this dataframe to look like this

id    value
0     1001
1     1002
10    1004
100   1005
101   1007
102   1008

Please help me to solve this using Pyspark.

2
  • 3
    can you provide he code to recreate the input dataframe so we dont have to do that and focus on the problem. See how to create a minimal, complete verifiable example Commented Jun 24, 2020 at 15:13
  • It's a Json file and I have created dataframe directly from it Commented Jun 25, 2020 at 13:21

2 Answers 2

0

Try this-

Written in scala but should be achieved in pyspark with minimal changes

Load the test data

    val df = spark.sql("select company from values (named_struct('0', 'foo', '1', 'bar')) T(company)")
    df.show(false)
    df.printSchema()
    /**
      * +----------+
      * |company   |
      * +----------+
      * |[foo, bar]|
      * +----------+
      *
      * root
      * |-- company: struct (nullable = false)
      * |    |-- 0: string (nullable = false)
      * |    |-- 1: string (nullable = false)
      */

Explode struct

    val structCols = df.schema("company").dataType.asInstanceOf[StructType].map(_.name)
    df.withColumn("company", map_from_arrays(
      array(structCols.map(lit): _*),
      array(structCols.map(c => col(s"company.$c")): _*)
    ))
      .selectExpr("explode(company) as (id, name)")
      .show(false)

    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |0  |foo |
      * |1  |bar |
      * +---+----+
      */
Sign up to request clarification or add additional context in comments.

2 Comments

But in the question you mentioned its of string type. Check your schema in the question
Updated the Question and some of the functions are not available in Pyspark
0

In python you can convert it using stack

import pyspark.sql.functions as f
from functools import reduce
df1 = df.select('company.*')
cols = ','.join([f"'{i[0]}',`{i[1]}`" for i in zip(df1.columns,df1.columns)])


df1 = reduce(lambda df, c: df.withColumn(c, f.col(c).cast('string')), df1.columns, df1)

df1.select(f.expr(f'''stack({len(df1.columns)},{cols}) as (id, name)''')).show()

+---+----+
| id|name|
+---+----+
|  0| foo|
|  1| bar|
+---+----+

1 Comment

company.* i.e. struct columns are of LongType not able to convert all columns to String Type in Pyspark

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.