Combine values from multiple columns into one Pyspark Dataframe [duplicate]

Question

I have a pyspark dataframe that has fields: "id", "fields_0_type" , "fields_0_price", "fields_1_type", "fields_1_price"

+------------------+--------------+-------------+-------------+---
|id  |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return      |45            |New          |50           |
+--------------+----------+--------------------+------------+

How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:

  +--------------------------- +------------------------------+
    |id     |type              | price
    +---------------------------+------------------------------+
    |1234   |Return,Upgrade    |45,50

Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.

Update:

Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it? What it is showing now:

+------------------------------------------------------------------+
|type                                                   |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,        |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+

How I want it:

+------------------------------------------------------------------+
    |type                                                   |
    +------------------------------------------------------------------+
    |New,New,New        |
    |New,New,Sale,Sale,New|
    +------------------------------------------------------------------+

replace the blanks with lit(None) something like df = df.select(*[when(col(c) == "", lit(None)).otherwise(col(c)) for c in cols]) before using concat_ws OR do a regexp_replace to remove trailing commas. — pault
– pault, Commented Aug 19, 2020 at 19:29

notNull · Accepted Answer · 2020-08-19 17:21:14Z

1

Cast all columns in array then use concat_ws function.

Example:

df.show()
#+----+-------------+-------------+-------------+
#|  id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234|            a|            b|            c|
#+----+-------------+-------------+-------------+

columns=df.columns
columns.remove('id')


df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#|  id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+

UPDATE:

df.show()
#+----+-------------+--------------+-------------+--------------+
#|  id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234|            a|            45|            b|            50|
#+----+-------------+--------------+-------------+--------------+

type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]

df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#|  id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+

edited Aug 19, 2020 at 17:21

answered Aug 19, 2020 at 16:45

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

LLL Over a year ago

Hi some update was made to the question

Collectives™ on Stack Overflow

Combine values from multiple columns into one Pyspark Dataframe [duplicate]

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related