3

I am trying to concat all the values in a column to make a string out of it with comma seperated values. To do that in Scala, I wrote the following code:

val pushLogIds = incLogIdDf.select($"interface_log_id").collect().map(_.getInt(0).toString).mkString(",")

I am new to Python and after selecting the values in the column, I am unable to find a logic to Python to concat all the column values to a String after collecting them.

final_log_id_list = logidf.select("interface_log_id").collect()

Ex:

interface_log_id
----------------
     1
     2
     3
     4

Output: a variable of String containing '1,2,3,4'

Could anyone let me know how to concat all the column values of a dataframe into a single String of comma separated values.

5
  • 1
    I had imported import pyspark.sql.functions as F so that the python builtins such as min , max etc are not overridden , hence every pyspark builtin needs an F prefix for me. You can ignore the F if you import without an alias Commented Apr 17, 2020 at 7:20
  • Got it. One last thing, to this will still result the output in a column in a dataframe. To conver it to a String, I did this >>> a = str(df.select('value').agg(F.concat_ws(",", F.collect_list(F.col('value'))))) >>> a 'DataFrame[concat_ws(,, collect_list(value)): string]' and it it still doesn't yild a String and instead comes a dataframe. Commented Apr 17, 2020 at 8:36
  • 1
    for a scalar you can do df.agg(F.concat_ws(",",F.collect_list(F.col("A"))).alias('A')).first()[0] Commented Apr 17, 2020 at 8:53
  • Can you post this as an answer ? Commented Apr 17, 2020 at 8:58
  • 1
    Posted , the linked answer is similar but cant say its an exact dupe. Commented Apr 17, 2020 at 9:05

1 Answer 1

4

For converting a column to a single string , you can first collect the column as a list using collect_list and then concat with , , finally get the first value as a scalar using first:

df.agg(F.concat_ws(",",F.collect_list(F.col("interface_log_id")))).first()[0]
#'1,2,3,4'

Another way is collect_list and then using python ','.join with map for numeric columns

','.join(map(str,df.agg(F.collect_list(F.col("A"))).first()[0]))

Adding benchmarks:

%timeit ','.join(map(str,df.agg(F.collect_list(F.col("A"))).first()[0]))
#9.38 s ± 133 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.agg(F.concat_ws(",",F.collect_list(F.col("A")))).first()[0]
#9.46 s ± 246 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

6 Comments

This works and what I am trying to achieve. I believe [0] is the column index ?
@Metadata yes first() gives you the row object , [0] gives you the first value of the row
@Metadata added another way , you can check for performance and use either
^for spark2.4 instead of concat_ws, id say use array_join
array_join works too: %timeit df.select(F.array_join(F.collect_list(F.col("A").cast("string")),",")).first()[0] , 9.87 s ± 253 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.