I need to iterate over a dataframe using pySpark just like we can iterate a set of values using for loop. Below is the code I have written. The problem with this code is
- I have to use collect which breaks the parallelism
- I am not able to print any values from the DataFrame in the function
funcRowIter - I cannot break the loop once I have the match found.
I have to do it in pySpark and cannot use pandas for this :
from pyspark.sql.functions import *
from pyspark.sql import HiveContext
from pyspark.sql import functions
from pyspark.sql import DataFrameWriter
from pyspark.sql.readwriter import DataFrameWriter
from pyspark import SparkContext
sc = SparkContext()
hive_context = HiveContext(sc)
tab = hive_context.sql("select * from update_poc.test_table_a")
tab.registerTempTable("tab")
print type(tab)
df = tab.rdd
def funcRowIter(rows):
print type(rows)
if(rows.id == "1"):
return 1
df_1 = df.map(funcRowIter).collect()
print df_1