6

I was trying to print total elements in each partitions in a DataFrame using spark 2.2

from pyspark.sql.functions import *
from pyspark.sql import SparkSession

def count_elements(splitIndex, iterator):
    n = sum(1 for _ in iterator)
    yield (splitIndex, n)

spark = SparkSession.builder.appName("tmp").getOrCreate()
num_parts = 3
df = spark.read.json("/tmp/tmp/gon_s.json").repartition(num_parts)
print("df has partitions."+ str(df.rdd.getNumPartitions()))
print("Elements across partitions is:" + str(df.rdd.mapPartitionsWithIndex(lambda ind, x: count_elements(ind, x)).take(3)))

The Code above kept failing with following error

  n = sum(1 for _ in iterator)
  File "/home/dev/wk/pyenv/py3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 40, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'

after removing the import below

from pyspark.sql.functions import *

Code works fine

skewed_large_df has partitions.3
The distribution of elements across partitions is:[(0, 1), (1, 2), (2, 2)]

What is it causing this error and how can I fix it?

4
  • 2
    Don't do import * as it can mess up your namespace. Do import pyspark.sql.functions as f and call the functions from that module using f.function_name(). I'm pretty sure you meant to call the builitin sum() and not pyspark.sql.functions.sum(). That's probably what's causing your issue. Commented Mar 26, 2018 at 14:16
  • Thanks for pointing that @pault. very helpful. I would except buildins.py sum method to have precedence over pyspark.sql.functions.sum() method! Commented Mar 26, 2018 at 17:33
  • 1
    I think @pault comment should be posted as answer. Commented Apr 16, 2019 at 12:44
  • Possible duplicate of pyspark Column is not iterable. Commented Apr 10, 2020 at 12:52

1 Answer 1

16

This is a great example of why you shouldn't use import *.

The line

from pyspark.sql.functions import *

will bring in all the functions in the pyspark.sql.functions module into your namespace, include some that will shadow your builtins.

The specific issue is in the count_elements function on the line:

n = sum(1 for _ in iterator)
#   ^^^ - this is now pyspark.sql.functions.sum

You intended to call __builtin__.sum, but the import * shadowed the builtin.

Instead, do one of the following:

import pyspark.sql.functions as f

Or

from pyspark.sql.functions import sum as sum_
Sign up to request clarification or add additional context in comments.

1 Comment

This is a legit answer. I'm not closing this question. Can you remove your vote close please ? :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.