1

I have a pyspark dataframe that has fields: "id", "fields_0_type" , "fields_0_price", "fields_1_type", "fields_1_price"

+------------------+--------------+-------------+-------------+---
|id  |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return      |45            |New          |50           |
+--------------+----------+--------------------+------------+

How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:

  +--------------------------- +------------------------------+
    |id     |type              | price
    +---------------------------+------------------------------+
    |1234   |Return,Upgrade    |45,50
    

Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.

Update:

Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it? What it is showing now:

+------------------------------------------------------------------+
|type                                                   |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,        |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+

How I want it:

+------------------------------------------------------------------+
    |type                                                   |
    +------------------------------------------------------------------+
    |New,New,New        |
    |New,New,Sale,Sale,New|
    +------------------------------------------------------------------+
3
  • 1
    replace the blanks with lit(None) something like df = df.select(*[when(col(c) == "", lit(None)).otherwise(col(c)) for c in cols]) before using concat_ws OR do a regexp_replace to remove trailing commas. Commented Aug 19, 2020 at 19:29
  • what does "cols" refer to here? Commented Aug 19, 2020 at 20:20
  • cols should be df.columns - that was a typo Commented Aug 19, 2020 at 23:10

1 Answer 1

1

Cast all columns in array then use concat_ws function.

Example:

df.show()
#+----+-------------+-------------+-------------+
#|  id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234|            a|            b|            c|
#+----+-------------+-------------+-------------+

columns=df.columns
columns.remove('id')


df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#|  id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+

UPDATE:

df.show()
#+----+-------------+--------------+-------------+--------------+
#|  id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234|            a|            45|            b|            50|
#+----+-------------+--------------+-------------+--------------+

type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]

df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#|  id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+
Sign up to request clarification or add additional context in comments.

1 Comment

Hi some update was made to the question

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.