-1

I would like to transform the values of a column into multiple columns of a dataframe in pyspark on databricks.

e.g

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark._sc.parallelize([["dapd", "shop", "retail"],
    ["dapd", "shop", "on-line"],
    ["dapd", "payment", "credit"],
    ["wrfr", "shop", "supermarket"],
    ["wrfr", "shop", "brand store"],
    ["wrfr", "payment", "cash"]]).toDF(["id", "value1", "value2"])

I need to transform it to:

id,     shop                       payment
dapd    retail|on-line             credit
wrfr    supermarket|brand store    cash

I am not sure how I can do this in pyspark ?

Thanks,

1
  • I’m having trouble understanding this, can you explain it differently? Commented Nov 21, 2019 at 5:49

2 Answers 2

1

What you're looking for a a combination of pivot and aggregation functions, such as collect_list() or collect_set(). Have a look at the available aggregation functions here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=agg#module-pyspark.sql.functions. Here's some code example:

from pyspark.sql import SparkSession
import pyspark.sql.functions as f

df = spark._sc.parallelize([
    ["dapd", "shop", "retail"],
    ["dapd", "shop", "on-line"],
    ["dapd", "payment", "credit"],
    ["wrfr", "shop", "supermarket"],
    ["wrfr", "shop", "brand store"],
    ["wrfr", "payment", "cash"]]
).toDF(["id", "value1", "value2"])

df.show()
+----+-------+-----------+
|  id| value1|     value2|
+----+-------+-----------+
|dapd|   shop|     retail|
|dapd|   shop|    on-line|
|dapd|payment|     credit|
|wrfr|   shop|supermarket|
|wrfr|   shop|brand store|
|wrfr|payment|       cash|
+----+-------+-----------+


df.groupBy('id').pivot('value1').agg(f.collect_list("value2")).show(truncate=False)
+----+--------+--------------------------+
|id  |payment |shop                      |
+----+--------+--------------------------+
|dapd|[credit]|[retail, on-line]         |
|wrfr|[cash]  |[supermarket, brand store]|
+----+--------+--------------------------+
Sign up to request clarification or add additional context in comments.

Comments

0

there is something like this you can do.

newdf=df.groupby('id').pivot('value1').agg(func.collect_list(func.col('value2')))
newdf=newdf.withColumn('shop',func.concat_ws('|',func.col('shop')[0],func.col('shop')[1]))
newdf=newdf.withColumn('payment',func.col('payment')[0])
newdf.show(20, False)
+----+-------+-----------------------+
|id  |payment|shop                   |
+----+-------+-----------------------+
|dapd|credit |retail|on-line         |
|wrfr|cash   |brand store|supermarket|
+----+-------+-----------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.