How to concat all column values in a spark dataframe into a String in Python?

Question

I am trying to concat all the values in a column to make a string out of it with comma seperated values. To do that in Scala, I wrote the following code:

val pushLogIds = incLogIdDf.select($"interface_log_id").collect().map(_.getInt(0).toString).mkString(",")

I am new to Python and after selecting the values in the column, I am unable to find a logic to Python to concat all the column values to a String after collecting them.

final_log_id_list = logidf.select("interface_log_id").collect()

Ex:

interface_log_id
----------------
     1
     2
     3
     4

Output: a variable of String containing '1,2,3,4'

Could anyone let me know how to concat all the column values of a dataframe into a single String of comma separated values.

I had imported import pyspark.sql.functions as F so that the python builtins such as min , max etc are not overridden , hence every pyspark builtin needs an F prefix for me. You can ignore the F if you import without an alias — anky
– anky, Commented Apr 17, 2020 at 7:20
Got it. One last thing, to this will still result the output in a column in a dataframe. To conver it to a String, I did this >>> a = str(df.select('value').agg(F.concat_ws(",", F.collect_list(F.col('value'))))) >>> a 'DataFrame[concat_ws(,, collect_list(value)): string]' and it it still doesn't yild a String and instead comes a dataframe. — Metadata
– Metadata, Commented Apr 17, 2020 at 8:36
for a scalar you can do df.agg(F.concat_ws(",",F.collect_list(F.col("A"))).alias('A')).first()[0] — anky
– anky, Commented Apr 17, 2020 at 8:53
Posted , the linked answer is similar but cant say its an exact dupe. — anky
– anky, Commented Apr 17, 2020 at 9:05

anky · Accepted Answer · 2020-04-17 09:27:11Z

4

For converting a column to a single string , you can first collect the column as a list using collect_list and then concat with , , finally get the first value as a scalar using first:

df.agg(F.concat_ws(",",F.collect_list(F.col("interface_log_id")))).first()[0]
#'1,2,3,4'

Another way is collect_list and then using python ','.join with map for numeric columns

','.join(map(str,df.agg(F.collect_list(F.col("A"))).first()[0]))

Adding benchmarks:

%timeit ','.join(map(str,df.agg(F.collect_list(F.col("A"))).first()[0]))
#9.38 s ± 133 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.agg(F.concat_ws(",",F.collect_list(F.col("A")))).first()[0]
#9.46 s ± 246 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Apr 17, 2020 at 9:27

answered Apr 17, 2020 at 6:13

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Metadata Over a year ago

This works and what I am trying to achieve. I believe [0] is the column index ?

anky Over a year ago

@Metadata yes first() gives you the row object , [0] gives you the first value of the row

anky Over a year ago

@Metadata added another way , you can check for performance and use either

murtihash Over a year ago

^for spark2.4 instead of concat_ws, id say use array_join

anky Over a year ago

array_join works too: %timeit df.select(F.array_join(F.collect_list(F.col("A").cast("string")),",")).first()[0] , 9.87 s ± 253 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

|

Collectives™ on Stack Overflow

How to concat all column values in a spark dataframe into a String in Python?

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related