3

I need to iterate over a dataframe using pySpark just like we can iterate a set of values using for loop. Below is the code I have written. The problem with this code is

  1. I have to use collect which breaks the parallelism
  2. I am not able to print any values from the DataFrame in the function funcRowIter
  3. I cannot break the loop once I have the match found.

I have to do it in pySpark and cannot use pandas for this :

from pyspark.sql.functions import *
from pyspark.sql import HiveContext
from pyspark.sql import functions
from pyspark.sql import DataFrameWriter
from pyspark.sql.readwriter import DataFrameWriter
from pyspark import SparkContext

sc = SparkContext()
hive_context = HiveContext(sc)

tab = hive_context.sql("select * from update_poc.test_table_a")

tab.registerTempTable("tab")
print type(tab)

df = tab.rdd

def funcRowIter(rows):
    print type(rows)
        if(rows.id == "1"):
            return 1

df_1 = df.map(funcRowIter).collect()
print df_1

2 Answers 2

4

Instead of using df_1 = df.map(funcRowIter).collect() you should try UDF. Hope this will help.

from pyspark.sql.functions import struct
from pyspark.sql.functions import *
def funcRowIter(rows):
    print type(rows)
    if(row is nor None and row.id is not None)
        if(rows.id == "1"):
            return 1
A = udf(funcRowIter, ArrayType(StringType()))
z = df.withColumn(data_id, A(struct([df[x] for x in df.columns])))
z.show()

collect() will never be the good options for very big data i.e millions of records

Sign up to request clarification or add additional context in comments.

Comments

0

Seems that your goal is to display a specific row. You could use.filter then a .collect.

For instance,

row_1 = rdd.filter(lambda x: x.id==1).collect()

However, it won't be efficient to try iterate over your dataframe this way.

1 Comment

I am trying to do this without collect() coz collect will break the parallelism

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.