2

Here is an exemple data and schema :

mySchema = StructType([
   StructField('firstname', StringType()),
   StructField('lastname', StringType()),
   StructField('langages', ArrayType(StructType([
         StructField('lang1', StringType()),
         StructField('lang2', StringType())
 ])))
])

myData = [("john", "smith", [
            {'lang1': 'Java', 'lang2': 'Python'},
            {'lang1': 'C', 'lang2': 'R'},
            {'lang1': 'Perl', 'lang2': 'Scala'}
            ]),
          ("robert", "plant", [
            {'lang1': 'C', 'lang2': 'Java'},
            {'lang1': 'Python', 'lang2': 'Perl'}
            ])
          ]

Then creating the dataframe :

df = spark.createDataFrame(data=myData, schema=mySchema)

The schema looks like :

df.printSchema()
root
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
|-- langages: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- lang1: string (nullable = true)
|    |    |-- lang2: string (nullable = true)

and when show() :

df.show(df.count(), False)
+---------+--------+---------------------------------------+
|firstname|lastname|langages                               |
+---------+--------+---------------------------------------+
|john     |smith   |[[Java, Python], [C, R], [Perl, Scala]]|
|robert   |plant   |[[C, Java], [Python, Perl]]            |
+---------+--------+---------------------------------------+

At this point, all is allright. But now, I want to "flat" arrays, create a column for each "langages", join langage with "/" in order to export it as csv. It may looks like this :

firstname   lastname    langage_1    langage_2    langage_3 
john        smith       Java/Python  C/R          Perl/Scala
robert      plant       C/Java       Python/Perl

I tried to create 3 columns like this :

df.select([(col("langages")[x]).alias("langage_"+str(x+1)) for x in range(0, 3)]).show()
+--------------+--------------+-------------+
|     langage_1|     langage_2|    langage_3|
+--------------+--------------+-------------+
|[Java, Python]|        [C, R]|[Perl, Scala]|
|     [C, Java]|[Python, Perl]|         null|
+--------------+--------------+-------------+

My problem is that sometime langages array may have 2 or 3 or 4 or whatever elements.

So the range(0, 3) maybe range(0, 4) !

I must find the max element of arrays

And I don't know how to concatenate each array like [Java, Python] in order to have "Java/Python"

Thanx for your help

1 Answer 1

1

First we collect the max number of columns needed. Then create the columns using this value.

n = df.select(F.max(F.size("langages")).alias("n_columns")).first().n_columns

df.select(
    "firstname",
    "lastname",
    *[F.concat_ws("/", F.array(
        F.col("langages").getItem(i).getItem("lang1"),
        F.col("langages").getItem(i).getItem("lang2"),
    )).alias("langages") for i in range(n)]
)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks but it's not really what I want...You build only one column with all langages but I want as many columns as sub array in array : for the first row a first column with "Java/ Python", a second column with "C/ R" a third column with "Perl/Scala" and for the second row a column with "C/Java" and a second column with "Python/Perl" and an empty thirdcolumn because there are only 2 columns for this row.
Wow that's great, it works like a charm, now I will look at this in details in order to understand deeply the code ! Thanx !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.