Python spark Dataframe to Elasticsearch

Question

I can't figure out how to write a dataframe to elasticsearch using python from spark. I followed the steps from here.

Here is my code:

# Read file
df = sqlContext.read \
    .format('com.databricks.spark.csv') \
    .options(header='true') \
    .load('/vagrant/data/input/input.csv', schema = customSchema)

df.registerTempTable("data")

# KPIs
kpi1 = sqlContext.sql("SELECT * FROM data")

es_conf = {"es.nodes" : "10.10.10.10","es.port" : "9200","es.resource" : "kpi"}
kpi1.rdd.saveAsNewAPIHadoopFile(
    path='-',
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
    keyClass="org.apache.hadoop.io.NullWritable",
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
    conf=es_conf)

Above code gives

Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

I also started the script from: spark-submit --master spark://aggregator:7077 --jars ../jars/elasticsearch-hadoop-2.4.0/dist/elasticsearch-hadoop-2.4.0.jar /vagrant/scripts/aggregation.py to ensure that elasticsearch-hadoop is loaded

@eliasah2.4.0, tried also using elasticsearch-hadoop-5.0.0-alpha5.jar for the 2.x versions of es — dimzak
– dimzak, Commented Sep 20, 2016 at 13:19

Community · Accepted Answer · 2017-05-23 12:00:56Z

4

For starters saveAsNewAPIHadoopFile expects a RDD of (key, value) pairs and in your case this may happen only accidentally. The same thing applies to the value format you declare.

I am not familiar with Elastic but just based on the arguments you should probably try something similar to this:

kpi1.rdd.map(lambda row: (None, row.asDict()).saveAsNewAPIHadoopFile(...)

Since Elastic-Hadoop provide SQL Data Source you should be also able to skip that and save data directly:

df.write.format("org.elasticsearch.spark.sql").save(...)

edited May 23, 2017 at 12:00

CommunityBot

11 silver badge

answered Sep 18, 2016 at 20:41

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

M.Sanchez · Accepted Answer · 2017-10-06 10:30:25Z

1

As zero323 said, the easiest way to load a Dataframe from PySpark to Elasticsearch is with the method

Dataframe.write.format("org.elasticsearch.spark.sql").save("index/type")

answered Oct 6, 2017 at 10:30

M.Sanchez

133 bronze badges

Comments

Suraj Rao · Accepted Answer · 2020-11-17 15:56:27Z

0

You can use something like this:

df.write.mode('overwrite').format("org.elasticsearch.spark.sql").option("es.resource", '%s/%s' % (conf['index'], conf['doc_type'])).option("es.nodes", conf['host']).option("es.port", conf['port']).save()

edited Nov 17, 2020 at 15:56

Suraj Rao

29.7k11 gold badges96 silver badges104 bronze badges

answered Nov 17, 2020 at 15:44

demonodojo

4221 gold badge4 silver badges8 bronze badges

Collectives™ on Stack Overflow

Python spark Dataframe to Elasticsearch

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related