Iterating each row of Data Frame using pySpark

Question

I need to iterate over a dataframe using pySpark just like we can iterate a set of values using for loop. Below is the code I have written. The problem with this code is

I have to use collect which breaks the parallelism
I am not able to print any values from the DataFrame in the function funcRowIter
I cannot break the loop once I have the match found.

I have to do it in pySpark and cannot use pandas for this :

from pyspark.sql.functions import *
from pyspark.sql import HiveContext
from pyspark.sql import functions
from pyspark.sql import DataFrameWriter
from pyspark.sql.readwriter import DataFrameWriter
from pyspark import SparkContext

sc = SparkContext()
hive_context = HiveContext(sc)

tab = hive_context.sql("select * from update_poc.test_table_a")

tab.registerTempTable("tab")
print type(tab)

df = tab.rdd

def funcRowIter(rows):
    print type(rows)
        if(rows.id == "1"):
            return 1

df_1 = df.map(funcRowIter).collect()
print df_1

Stephen Rauch · Accepted Answer · 2017-11-18 18:37:11Z

4

Instead of using df_1 = df.map(funcRowIter).collect() you should try UDF. Hope this will help.

from pyspark.sql.functions import struct
from pyspark.sql.functions import *
def funcRowIter(rows):
    print type(rows)
    if(row is nor None and row.id is not None)
        if(rows.id == "1"):
            return 1
A = udf(funcRowIter, ArrayType(StringType()))
z = df.withColumn(data_id, A(struct([df[x] for x in df.columns])))
z.show()

collect() will never be the good options for very big data i.e millions of records

edited Nov 18, 2017 at 18:37

Stephen Rauch♦

50.1k32 gold badges118 silver badges143 bronze badges

answered Nov 18, 2017 at 18:11

Rahul Kumar Singh

1171 silver badge14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Bernard Jesop · Accepted Answer · 2017-01-30 17:40:58Z

0

Seems that your goal is to display a specific row. You could use.filter then a .collect.

For instance,

row_1 = rdd.filter(lambda x: x.id==1).collect()

However, it won't be efficient to try iterate over your dataframe this way.

answered Jan 30, 2017 at 17:40

Bernard Jesop

7866 silver badges11 bronze badges

1 Comment

Ashay Dhavale Over a year ago

I am trying to do this without collect() coz collect will break the parallelism

Collectives™ on Stack Overflow

Iterating each row of Data Frame using pySpark

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related