2

I wrote a very simple Spark code in Python:

import collections
Person = collections.namedtuple('Person', ['name', 'age', 'gender'])

a = sc.parallelize([['Barack Obama', 54, 'M'], ['Joe Biden', 74, 'M']])
a = a.map(lambda row: Person(*row))

print a.collect()

def func(row):
    tmp = row._replace(name='Jack Rabbit')
    return tmp

print a.map(func).collect()

I get following output and error:

[Person(name='Barack Obama', age=29, gender='M'), Person(name='Joe Biden', age=57, gender='M')]

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in stage 11.0 failed 4 times, most recent failure: Lost task 49.3 in stage 11.0 (TID 618, 172.19.75.121): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/etc/spark-1.4.0-bin-cdh4/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/etc/spark-1.4.0-bin-cdh4/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/etc/spark-1.4.0-bin-cdh4/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-19-f0b4885784cb>", line 2, in func
  File "<string>", line 32, in _replace
RuntimeError: uninitialized classmethod object

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

However, if I run the following piece of code, I don't get any error:

for row in a.collect():
  func(row)

What gives?

1 Answer 1

2

Edit:

Support for serializing named tuples has been introduced with SPARK-10542.

Original answer

Why it doesn't work? Because namedtuple call creates a class and classes in Spark are not serialized as a part of the closure. It means you have to create a separate module* and make sure it is available on the workers:

txt = "\n".join(["import collections",
    "Person = collections.namedtuple('Person', ['name', 'age', 'gender'])"])

with open("persons.py", "w") as fw:
    fw.write(txt)

sc.addPyFile("persons.py")  # Ship module to the worker nodes

Next you can simply import and everything should work as expected:

import persons

a.map(func).collect()

On a side note leading underscore is there for a reason.


* It could be done dynamically like this a.map(lambda row: collections.namedtuple('Person', ['name', 'age', 'gender'])(*row)) or by defining Person inside mapPartitions but it is neither pretty or efficient.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the answer, solved my problem. However, I still wonder, how do worker nodes know how to serialize and de-serialize data for custom types such as the namedtuple above?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.