Create Spark DataFrame in Spark Streaming from JSON Message on Kafka

Question

I'm working on an implementation of Spark Streaming in Scala where I am pull JSON Strings from a Kafka topic and want to load them into a dataframe. Is there a way to do this where Spark infers the schema on it's own from an RDD[String]?

Kiara Grouwstra · Accepted Answer · 2015-09-09 12:03:29Z

3

Yes, you can use the following:

sqlContext.read
//.schema(schema) //optional, makes it a bit faster, if you've processed it before you can get the schema using df.schema
.json(jsonRDD)  //RDD[String]

I'm trying to do the same at the moment. I'm curious how you got the RDD[String] out of Kafka though, I'm still under the impression Spark+Kafka only does streaming rather than "take out what's in there right now" one-off batch. :)

answered Sep 9, 2015 at 12:03

Kiara Grouwstra

5,9616 gold badges24 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cody Koeninger Over a year ago

You can use KafkaUtils.createRDD to get a non-streaming RDD from Kafka

sparklearner · Accepted Answer · 2015-06-26 15:53:25Z

2

In spark 1.4, you could try the following method to generate a Dataframe from rdd:

  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  val yourDataFrame = hiveContext.createDataFrame(yourRDD)

answered Jun 26, 2015 at 15:53

sparklearner

4033 silver badges7 bronze badges

1 Comment

sparklearner Over a year ago

This is similar as the following question: stackoverflow.com/questions/29383578/…

radek1st · Accepted Answer · 2016-08-05 15:01:59Z

1

You can use the below code to read in the stream of messages from Kafka, extract the JSON values and convert them to DataFrame:

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

messages.foreachRDD { rdd =>
//extracting the values only
  val df = sqlContext.read.json(rdd.map(x => x._2))
  df.show()
}

answered Aug 5, 2016 at 15:01

radek1st

1,65718 silver badges19 bronze badges

Comments

Brian · Accepted Answer · 2018-06-08 17:43:30Z

0

There is not schema inference on streaming. You can always read a file and pull the schema from it. You could also commit the file to version control and put it in a s3 bucket.

answered Jun 8, 2018 at 17:43

Brian

1,0141 gold badge13 silver badges42 bronze badges

Collectives™ on Stack Overflow

Create Spark DataFrame in Spark Streaming from JSON Message on Kafka

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related