Convert spark dataframe to list of tuples without pandas dataframe

Question

I have an existing logic which converts pandas dataframe to list of tuples.

list(zip(*[df[c].values.tolist() for c in df]))

where df is a pandas dataframe.

Somebody please help me implement the same logic without pandas in pyspark.

It isn't clear to me what the relation between pandas and spark is and why you're mentioning it. — 89f3a1c
– 89f3a1c, Commented Oct 14, 2019 at 21:57
df is created by calling toPandas() on a spark dataframe, I would like directly convert the spark dataframe to list of tuples. — Corey
– Corey, Commented Oct 14, 2019 at 22:22

Community · Accepted Answer · 2020-06-20 09:12:55Z

6

You can first convert the dataframe to an RDD using the rdd method. A Row in dataframes is a tuple too, so you can just:

rdd = df.rdd
b = rdd.map(tuple)
b.collect()

Example DF:

df.show()
+-----+-----+
| Name|Score|
+-----+-----+
|name1|11.23|
|name2|14.57|
|name3| 2.21|
|name4| 8.76|
|name5|18.71|
+-----+-----+

After b.collect()

[('name1', 11.23), ('name2', 14.57), ('name3', 2.21), ('name4', 8.76), ('name5', 18.71)]

EDIT

If you're going to loop over this list of tuples, You may call collect() but the right method is toLocalIterator()

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Oct 15, 2019 at 3:17

pissall

7,4442 gold badges29 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Corey Over a year ago

I liked your solution, can we do it without collect?

pissall Over a year ago

@Thomas collect has been just used to show you output. The solution works witbout collect

Corey Over a year ago

I have another method which is expecting list of tuples and doesn't work if i pass b to it since b is still a rdd

pissall Over a year ago

@Thomas I extended my answer

QuantStats · Accepted Answer · 2019-10-16 03:05:28Z

2

An alternative without collect but with collect_list

import pyspark.sql.functions as F

df.show()
+-----+-----+
| Name|Score|
+-----+-----+
|name1|11.23|
|name2|14.57|
|name3| 2.21|
|name4| 8.76|
|name5|18.71|
+-----+-----+

@F.udf
def combo(*args):
  return [_ for _ in args][0]

df.withColumn('Combo', combo(F.array('Name','Score'))).agg(F.collect_list('Combo')).show(truncate=False)

+--------------------------------------------------------------------------+
|collect_list(Combo)                                                       |
+--------------------------------------------------------------------------+
|[[name1, 11.23],[name2, 14.57],[name3, 2.21],[name4, 8.76],[name5, 18.71]]|
+--------------------------------------------------------------------------+

edited Oct 16, 2019 at 3:05

answered Oct 16, 2019 at 2:44

QuantStats

1,5061 gold badge10 silver badges15 bronze badges

Collectives™ on Stack Overflow

Convert spark dataframe to list of tuples without pandas dataframe

2 Answers 2

EDIT

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

EDIT

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related