Using Apache Spark to do ML. Keep getting serializing errors

Question

so I'm using Spark to do sentiment analysis, and I keep getting errors with the serializers it uses (I think) to pass python objects around.

PySpark worker failed with exception:
Traceback (most recent call last):
  File "/Users/abdul/Desktop/RSI/spark-1.0.1-bin-    hadoop1/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/abdul/Desktop/RSI/spark-1.0.1-bin-    hadoop1/python/pyspark/serializers.py", line 191, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/Users/abdul/Desktop/RSI/spark-1.0.1-bin-    hadoop1/python/pyspark/serializers.py", line 123, in dump_stream
    for obj in iterator:
  File "/Users/abdul/Desktop/RSI/spark-1.0.1-bin-    hadoop1/python/pyspark/serializers.py", line 180, in _batched
    for item in iterator:
TypeError: __init__() takes exactly 3 arguments (2 given)

and the code for serializers is available here

and my code is here

$\begingroup$ See stackoverflow.com/questions/22592811/… $\endgroup$

samthebest
– samthebest

2014-07-26 09:58:00 +00:00
Commented Jul 26, 2014 at 9:58 — samthebest
– samthebest, Commented Jul 26, 2014 at 9:58

ffriend · Accepted Answer · 2014-07-26 00:11:03Z

Most often serialization error in (Py)Spark means that some part of your distributed code (e.g. functions passed to map) has dependencies on non-serializable data. Consider following example:

rdd = sc.parallelize(range(5))
rdd = rdd.map(lambda x: x + 1)
rdd.collect()

Here you have distributed collection and lambda function to send to all workers. Lambda is completely self-containing, so it's easy to copy its binary representation to other nodes without any worries.

Now let's make things a bit more interesting:

f = open("/etc/hosts")
rdd = sc.parallelize(range(100))
rdd = rdd.map(lambda x: f.read())
rdd.collect()
f.close()

Boom! Strange error in serialization module! What just happened is that we had attempted to pass f, which is a file object, to workers. Obviously, file object is a handle to local data and thus cannot be sent to other machines.

So what's happening in your specific code? Without actual data and knowing record format, I cannot debug it completely, but I guess that problem goes from this line:

def vectorizer(text, vocab=vocab_dict):

In Python, keyword arguments are initialized when function is called for the first time. When you call sc.parallelize(...).map(vectorizer) just after its definition, vocab_dict is available locally, but remote workers know absolutely nothing about it. Thus function is called with fewer parameters than it expects which results in __init__() takes exactly 3 arguments (2 given) error.

Also note, that you follow very bad pattern of sc.parallelize(...)...collect() calls. First you spread your collection to entire cluster, do some computations, and then pull the result. But sending data back and forth is pretty pointless here. Instead, you can just do these computations locally, and run Spark's parallel processes only when you work with really big datasets (like you main amazon_dataset, I guess).

Stack Exchange Network

Using Apache Spark to do ML. Keep getting serializing errors

1 Answer 1

Your Answer

Hot Network Questions

Using Apache Spark to do ML. Keep getting serializing errors

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions