I have an existing logic which converts pandas dataframe to list of tuples.
list(zip(*[df[c].values.tolist() for c in df]))
where df is a pandas dataframe.
Somebody please help me implement the same logic without pandas in pyspark.
I have an existing logic which converts pandas dataframe to list of tuples.
list(zip(*[df[c].values.tolist() for c in df]))
where df is a pandas dataframe.
Somebody please help me implement the same logic without pandas in pyspark.
You can first convert the dataframe to an RDD using the rdd method.
A Row in dataframes is a tuple too, so you can just:
rdd = df.rdd
b = rdd.map(tuple)
b.collect()
Example DF:
df.show()
+-----+-----+
| Name|Score|
+-----+-----+
|name1|11.23|
|name2|14.57|
|name3| 2.21|
|name4| 8.76|
|name5|18.71|
+-----+-----+
After b.collect()
[('name1', 11.23), ('name2', 14.57), ('name3', 2.21), ('name4', 8.76), ('name5', 18.71)]
If you're going to loop over this list of tuples, You may call collect() but the right method is toLocalIterator()
An alternative without collect but with collect_list
import pyspark.sql.functions as F
df.show()
+-----+-----+
| Name|Score|
+-----+-----+
|name1|11.23|
|name2|14.57|
|name3| 2.21|
|name4| 8.76|
|name5|18.71|
+-----+-----+
@F.udf
def combo(*args):
return [_ for _ in args][0]
df.withColumn('Combo', combo(F.array('Name','Score'))).agg(F.collect_list('Combo')).show(truncate=False)
+--------------------------------------------------------------------------+
|collect_list(Combo) |
+--------------------------------------------------------------------------+
|[[name1, 11.23],[name2, 14.57],[name3, 2.21],[name4, 8.76],[name5, 18.71]]|
+--------------------------------------------------------------------------+