How to write structured stream data to Cassandra table using pyspark?

Question

This is my terminal command to run strm.py file

$SPARK_HOME/bin/spark-submit --master local --driver-memory 4g --num-executors 2 --executor-memory 4g --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 org.apache.spark:spark-cassandra-connector_2.11:2.4.0 strm.py

Error:

Cannot load main class from JAR org.apache.spark:spark-cassandra-connector_2.11:2.4.0 with URI org.apache.spark. Please specify a class through --class. at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657) atorg.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:224) at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:116) at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.(SparkSubmit.scala:907) at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

So can anyone help me out what is the issue with this why it can't loading.

This is the code i wrote for storing streaming data to Cassandra table . query1 = query.writeStream\ .option("checkpointLocation", '/tmp/check_point/')\ .format("org.apache.spark.sql.cassandra")\ .option("keyspace","test")\ .option("table", "my_tables")\ .start()\ .awaitTermination() — Instinct
– Instinct, Commented Mar 4, 2020 at 9:03

Alex Ott · Accepted Answer · 2020-03-04 12:27:49Z

1

You have 2 problems:

you're incorrectly submitting your application - you don't have a comma between org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 and org.apache.spark:spark-cassandra-connector_2.11:2.4.0, so spark-submit treats cassandra connector as a jar, instead of using your python file.
current version of Spark Cassandra Connector doesn't support direct write for Spark Structured Streaming data - this functionality is available only in DSE Analytics. But you can workaround this by using foreachBatch, something like this (not tested, the working Scala code is available here):

def foreach_batch_function(df, epoch_id):
    df.format("org.apache.spark.sql.cassandra").option("keyspace","test")\
       .option("table", "my_tables").mode('append').save()

query.writeStream.foreachBatch(foreach_batch_function).start()

answered Mar 4, 2020 at 12:27

Alex Ott

88.1k10 gold badges110 silver badges157 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Instinct Over a year ago

After writing above function i am getting some error : pyspark.sql.utils.AnalysisException: 'Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;

Instinct Over a year ago

And also what is epoch_id and what is this used for in def foreach_batch_function(df, epoch_id): ?

Alex Ott Over a year ago

Regarding node - adjust it to your needs... as error says - you may need to set watermark, but it’s up to your logic. Epoch ID is a number that could be used for some tracking - you can ignore as of now, but it should be in function signature

Instinct Over a year ago

I tried this query to store streaming data to Cassandra table:>defwriteToCassandra(writeDF, epochId):writeDF.write\.format("org.apache.spark.sql.cassandra")\.options(table="my_tables", keyspace="test")\ .mode("append") \.save() query = df.writeStream \ .trigger(processingTime="10 seconds") \ .outputMode("update") \ .foreachBatch(writeToCassandra) \ .start()\ .awaitTermination() But i got some errors org.apache.spark.sql.cassandra. Please find packages at spark.apache.org/third-party-projects.html

Alex Ott Over a year ago

it looks like that package isn't loaded... before going with the spark streaming job, just check if you can access Cassandra from pyspark at all: github.com/datastax/spark-cassandra-connector/blob/master/doc/…

|

Collectives™ on Stack Overflow

How to write structured stream data to Cassandra table using pyspark?

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related