5

I followed this article to send some data to AWS ES, and I used the jar elasticsearch-hadoop. Here is my script:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
if __name__ == "__main__":
    conf = SparkConf().setAppName("WriteToES")
    sc = SparkContext(conf=conf)
    sqlContext = SQLContext(sc)
    es_conf = {"es.nodes" : "https://search-elasticsearchdomaine.region.es.amazonaws.com/",
    "es.port" : "9200","es.nodes.client.only" : "true","es.resource" : "sensor_counts/metrics"}
    es_df_p = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("output/part-00000-c353bb29-f189-4189-b35b-f7f1af717355.csv")
    es_df_pf= es_df_p.groupBy("network_key")
    es_df_pf.saveAsNewAPIHadoopFile(
    path='-',
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
    keyClass="org.apache.hadoop.io.NullWritable",
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
    conf=es_conf)

Then I run this command line:

spark-submit --jars elasticsearch-spark-20_2.11-5.3.1.jar write_to_es.py

where write_to_es.py is the script above.

Here is the error I got:

17/05/05 17:51:52 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
17/05/05 17:51:52 INFO HadoopRDD: Input split: file:/home/user/spark-2.1.0-bin-hadoop2.7/output/part-00000-c353bb29-f189-4189-b35b-f7f1af717355.csv:0+178633
17/05/05 17:51:52 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1143 bytes result sent to driver
17/05/05 17:51:52 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 11 ms on localhost (executor driver) (1/1)
17/05/05 17:51:52 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
17/05/05 17:51:52 INFO DAGScheduler: ResultStage 1 (load at NativeMethodAccessorImpl.java:0) finished in 0,011 s
17/05/05 17:51:52 INFO DAGScheduler: Job 1 finished: load at NativeMethodAccessorImpl.java:0, took 0,018727 s
17/05/05 17:51:52 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.1.26:39609 in memory (size: 2.1 KB, free: 366.3 MB)
17/05/05 17:51:52 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.1.26:39609 in memory (size: 22.9 KB, free: 366.3 MB)
17/05/05 17:51:52 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.1.26:39609 in memory (size: 2.1 KB, free: 366.3 MB)
Traceback (most recent call last):
  File "/home/user/spark-2.1.0-bin-hadoop2.7/write_to_es.py", line 11, in <module>
    es_df_pf.saveAsNewAPIHadoopFile(
  File "/home/user/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 964, in __getattr__
AttributeError: 'DataFrame' object has no attribute 'saveAsNewAPIHadoopFile'
17/05/05 17:51:53 INFO SparkContext: Invoking stop() from shutdown hook
17/05/05 17:51:53 INFO SparkUI: Stopped Spark web UI at http://192.168.1.26:4040
17/05/05 17:51:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/05/05 17:51:53 INFO MemoryStore: MemoryStore cleared
17/05/05 17:51:53 INFO BlockManager: BlockManager stopped
17/05/05 17:51:53 INFO BlockManagerMaster: BlockManagerMaster stopped
17/05/05 17:51:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/05/05 17:51:53 INFO SparkContext: Successfully stopped SparkContext
17/05/05 17:51:53 INFO ShutdownHookManager: Shutdown hook called
17/05/05 17:51:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-501c4efa-5402-430e-93c1-aaff4caddef0
17/05/05 17:51:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-501c4efa-5402-430e-93c1-aaff4caddef0/pyspark-52406fa8-e8d1-4aca-bcb6-91748dc87507

How to solve this:

 AttributeError: 'DataFrame' object has no attribute 'saveAsNewAPIHadoopFile'

Any help or suggestion is very appreciated.

0

2 Answers 2

4

I had the same problem.

After reading this article, I found the answer!!!

You have to convert to PythonRDD Type like this:

>>> type(df)
<class 'pyspark.sql.dataframe.DataFrame'>

>>> type(df.rdd)
<class 'pyspark.rdd.RDD'>

>>> df.rdd.saveAsNewAPIHadoopFile(...) # Got the same error message

>>> df.printSchema() # My schema
root
 |-- id: string (nullable = true)
 ...

# Let's convert to PythonRDD
>>> python_rdd = df.map(lambda item: ('key', {
... 'id': item['id'],
    ...
... }))

>>> python_rdd
PythonRDD[42] at RDD at PythonRDD.scala:43

>>> python_rdd.saveAsNewAPIHadoopFile(...) # Now, success
Sign up to request clarification or add additional context in comments.

1 Comment

What's the difference between PythonRDD and RDD? Why RDD raises the error but not PythonRDD?
0

saveAsNewAPIHadoopFile is in RDD ,

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

I guess this line should be

es_df_pf.rdd.saveAsNewAPIHadoopFile

1 Comment

When I tried it, it gave me a huge error : 17/05/09 09:40:21 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.