Pyspark 'NoneType' object has no attribute '_jvm' error

Question

I was trying to print total elements in each partitions in a DataFrame using spark 2.2

from pyspark.sql.functions import *
from pyspark.sql import SparkSession

def count_elements(splitIndex, iterator):
    n = sum(1 for _ in iterator)
    yield (splitIndex, n)

spark = SparkSession.builder.appName("tmp").getOrCreate()
num_parts = 3
df = spark.read.json("/tmp/tmp/gon_s.json").repartition(num_parts)
print("df has partitions."+ str(df.rdd.getNumPartitions()))
print("Elements across partitions is:" + str(df.rdd.mapPartitionsWithIndex(lambda ind, x: count_elements(ind, x)).take(3)))

The Code above kept failing with following error

  n = sum(1 for _ in iterator)
  File "/home/dev/wk/pyenv/py3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 40, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'

after removing the import below

from pyspark.sql.functions import *

Code works fine

skewed_large_df has partitions.3
The distribution of elements across partitions is:[(0, 1), (1, 2), (2, 2)]

What is it causing this error and how can I fix it?

Don't do import * as it can mess up your namespace. Do import pyspark.sql.functions as f and call the functions from that module using f.function_name(). I'm pretty sure you meant to call the builitin sum() and not pyspark.sql.functions.sum(). That's probably what's causing your issue. — pault
– pault, Commented Mar 26, 2018 at 14:16
Thanks for pointing that @pault. very helpful. I would except buildins.py sum method to have precedence over pyspark.sql.functions.sum() method! — user400058
– user400058, Commented Mar 26, 2018 at 17:33

pault · Accepted Answer · 2019-04-16 16:00:36Z

16

This is a great example of why you shouldn't use import *.

The line

from pyspark.sql.functions import *

will bring in all the functions in the pyspark.sql.functions module into your namespace, include some that will shadow your builtins.

The specific issue is in the count_elements function on the line:

n = sum(1 for _ in iterator)
#   ^^^ - this is now pyspark.sql.functions.sum

You intended to call __builtin__.sum, but the import * shadowed the builtin.

Instead, do one of the following:

import pyspark.sql.functions as f

Or

from pyspark.sql.functions import sum as sum_

edited Apr 16, 2019 at 16:00

answered Apr 16, 2019 at 14:56

pault

43.7k17 gold badges120 silver badges160 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

eliasah Over a year ago

This is a legit answer. I'm not closing this question. Can you remove your vote close please ? :)

Collectives™ on Stack Overflow

Pyspark 'NoneType' object has no attribute '_jvm' error

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related