Write dataframe to kafka pyspark

Question

I have a spark dataframe which I would like to write to Kafka. I have tried below snippet,

from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
for row in df.rdd.collect():
    producer.send('topic',str(row.asDict()))
    producer.flush()

This works but problem with this snippet is this is not Scalable as every time collect runs, data will be aggregated on driver node and can slow down all operations.

As foreach operation on dataframe can run in parallel on worker nodes. I tried below approach.

from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
def custom_fun(row):
    producer.send('topic',str(row.asDict()))
    producer.flush()

df.foreach(custom_fun)

This doesn't and gives pickling error. PicklingError: Cannot pickle objects of type <type 'itertools.count'> Not able to understand the reason behind this error. Can anyone help me understand this error or provide any other parallel solution?

What is the Spark version and Python version? Do you get the same error when you run this code with clean session? — Alper t. Turker
– Alper t. Turker, Commented Jan 16, 2018 at 13:36
Hi, spark version is 2.1 and python is 2.7. Not sure what do you mean by clean session but I get same error everytime I launch job on yan using spark-submit. — Nachiket Kate
– Nachiket Kate, Commented Jan 16, 2018 at 13:42
@NachiketKate : you were able to find the answer? I am facing the same issue. Not able to write confluent kafka topic. — Ajay Kumar
– Ajay Kumar, Commented May 3, 2021 at 8:38

Alper t. Turker · Accepted Answer · 2018-01-16 13:59:08Z

3

The error you get looks unrelated to Kafka writes. Looks like somewhere else in your code you use itertools.count (AFAIK it is not used in Spark's source at all, it is of course possible that it comes with KafkaProducer) which is for some reason serialized with cloudpickle module. Changing Kafka writing code might have no impact at all. If KafkaProducer is the source of the error, you should be able to resolve this with forachPartition:

from kafka import KafkaProducer


def send_to_kafka(rows):
    producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
    for row in rows:
        producer.send('topic',str(row.asDict()))  
        producer.flush()

df.foreachPartition(send_to_kafka)

That being said:

or provide any other parallel solution?

I would recommend using Kafka source. Include Kafka SQL package, for example:

spark.jars.packages  org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0

And:

from pyspark.sql.functions import to_json, col, struct

(df 
    .select(to_json(struct([col(c).alias(c) for c in df.columns])))
    .write
    .format("kafka") 
    .option("kafka.bootstrap.servers", botstrap_servers) 
    .option("topic", topic)
    .save())

edited Jan 16, 2018 at 13:59

answered Jan 16, 2018 at 13:53

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nachiket Kate Over a year ago

Thanks for answer. I'll try this and will let you know.

Nachiket Kate Over a year ago

With dataframe.write() I get nosuchmethoderror. Looks like version mismatch with spark, kafka, spark-sql-kafka.

Alper t. Turker Over a year ago

spark-sql-kafka component has to match Spark and Scala version

earl Over a year ago

how about sending just one column of dataframe to kafka instead of entire record?

Collectives™ on Stack Overflow

Write dataframe to kafka pyspark

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related