How to split an array structure to csv in PysPark

Question

Here is an exemple data and schema :

mySchema = StructType([
   StructField('firstname', StringType()),
   StructField('lastname', StringType()),
   StructField('langages', ArrayType(StructType([
         StructField('lang1', StringType()),
         StructField('lang2', StringType())
 ])))
])

myData = [("john", "smith", [
            {'lang1': 'Java', 'lang2': 'Python'},
            {'lang1': 'C', 'lang2': 'R'},
            {'lang1': 'Perl', 'lang2': 'Scala'}
            ]),
          ("robert", "plant", [
            {'lang1': 'C', 'lang2': 'Java'},
            {'lang1': 'Python', 'lang2': 'Perl'}
            ])
          ]

Then creating the dataframe :

df = spark.createDataFrame(data=myData, schema=mySchema)

The schema looks like :

df.printSchema()
root
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
|-- langages: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- lang1: string (nullable = true)
|    |    |-- lang2: string (nullable = true)

and when show() :

df.show(df.count(), False)
+---------+--------+---------------------------------------+
|firstname|lastname|langages                               |
+---------+--------+---------------------------------------+
|john     |smith   |[[Java, Python], [C, R], [Perl, Scala]]|
|robert   |plant   |[[C, Java], [Python, Perl]]            |
+---------+--------+---------------------------------------+

At this point, all is allright. But now, I want to "flat" arrays, create a column for each "langages", join langage with "/" in order to export it as csv. It may looks like this :

firstname   lastname    langage_1    langage_2    langage_3 
john        smith       Java/Python  C/R          Perl/Scala
robert      plant       C/Java       Python/Perl

I tried to create 3 columns like this :

df.select([(col("langages")[x]).alias("langage_"+str(x+1)) for x in range(0, 3)]).show()
+--------------+--------------+-------------+
|     langage_1|     langage_2|    langage_3|
+--------------+--------------+-------------+
|[Java, Python]|        [C, R]|[Perl, Scala]|
|     [C, Java]|[Python, Perl]|         null|
+--------------+--------------+-------------+

My problem is that sometime langages array may have 2 or 3 or 4 or whatever elements.

So the range(0, 3) maybe range(0, 4) !

I must find the max element of arrays

And I don't know how to concatenate each array like [Java, Python] in order to have "Java/Python"

Thanx for your help

Thijs · Accepted Answer · 2021-10-18 08:42:01Z

1

First we collect the max number of columns needed. Then create the columns using this value.

n = df.select(F.max(F.size("langages")).alias("n_columns")).first().n_columns

df.select(
    "firstname",
    "lastname",
    *[F.concat_ws("/", F.array(
        F.col("langages").getItem(i).getItem("lang1"),
        F.col("langages").getItem(i).getItem("lang2"),
    )).alias("langages") for i in range(n)]
)

edited Oct 18, 2021 at 8:42

answered Oct 15, 2021 at 14:12

Thijs

2961 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Fabrice Over a year ago

Thanks but it's not really what I want...You build only one column with all langages but I want as many columns as sub array in array : for the first row a first column with "Java/ Python", a second column with "C/ R" a third column with "Perl/Scala" and for the second row a column with "C/Java" and a second column with "Python/Perl" and an empty thirdcolumn because there are only 2 columns for this row.

Fabrice Over a year ago

Wow that's great, it works like a charm, now I will look at this in details in order to understand deeply the code ! Thanx !

Collectives™ on Stack Overflow

How to split an array structure to csv in PysPark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related