9

I'm working on an implementation of Spark Streaming in Scala where I am pull JSON Strings from a Kafka topic and want to load them into a dataframe. Is there a way to do this where Spark infers the schema on it's own from an RDD[String]?

4 Answers 4

3

Yes, you can use the following:

sqlContext.read
//.schema(schema) //optional, makes it a bit faster, if you've processed it before you can get the schema using df.schema
.json(jsonRDD)  //RDD[String]

I'm trying to do the same at the moment. I'm curious how you got the RDD[String] out of Kafka though, I'm still under the impression Spark+Kafka only does streaming rather than "take out what's in there right now" one-off batch. :)

Sign up to request clarification or add additional context in comments.

1 Comment

You can use KafkaUtils.createRDD to get a non-streaming RDD from Kafka
2

In spark 1.4, you could try the following method to generate a Dataframe from rdd:

  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  val yourDataFrame = hiveContext.createDataFrame(yourRDD)

1 Comment

This is similar as the following question: stackoverflow.com/questions/29383578/…
1

You can use the below code to read in the stream of messages from Kafka, extract the JSON values and convert them to DataFrame:

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

messages.foreachRDD { rdd =>
//extracting the values only
  val df = sqlContext.read.json(rdd.map(x => x._2))
  df.show()
}

Comments

0

There is not schema inference on streaming. You can always read a file and pull the schema from it. You could also commit the file to version control and put it in a s3 bucket.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.