0

I want to convert incoming JSON data from Kafka into a dataframe.

I am using structured streaming with Scala 2.12

Most people add a hard coded schema, but if the json can have additional fields, it requires changing the code base every-time, which is tedious.

One approach is to write it into a file and infer it with but I rather avoid doing that.

Is there any other way to approach this problem?

Edit: Found a way to turn a json string into a dataframe but cant extract it from the stream source, it is possible to extract it?

1
  • EDIT: i successfully made a schema object via the schema registry api, it returned a schema object, it is possible to convert it to a StructType Object?? Commented Jun 27, 2020 at 16:31

3 Answers 3

1
  1. One way is to store the schema itself in the message headers (not in the key or value).

    Though, this increases message size, it will be easy to parse the JSON value without the need for any external resource like a file or a schema registry.

    New messages can have new schemas while at the same time old messages can still be processed using their old schema itself, because the schema is within the message itself.

  2. Alternatively, you can version the schemas and include an id for every schema in the message headers (or) a magic byte in the key or value and infer the schema from there.

    This approach is followed by Confluent Schema registry. It allows you to basically go through different versions of same schema and see how your schema has evolved over time.

Sign up to request clarification or add additional context in comments.

2 Comments

i have a schema registry that stores the Avro schema of the data. wouldn't there be serialization issues for (JsonString , AvroSchema )=> Dataframe?
@coding_potato You may want to write a custom de-serializer that detects the type of the data (avro or json) and deserialize it accordingly. Avro schemas, if you are using in confluent avro serializer, typically puts a magic byte to mark the avro schema, if you find that magic byte, then you can de-serialize it as avro, by any chance if you get an exception while de-serializing, try de-serializing it as json.
0

Read the data as string and then convert it to map[string,String], this way you can process the any json without even knowing its schema

2 Comments

solution seems fine, however if the data contains other types i will get an encoding error java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String is there a workaround/ could you provide a basic code example?
when u try to convert the json string to Map[String,String], there would be any error. When u try to access the particular value (like Double ) you would be facing such issue. workaround would using 'String.valueOf()'
0

based on JavaTechnical answer , the best approach would be to use a schema registry and avro data instead of json, there is no going around hardcoding a schema (for now).

include your schema name and id as a header and use them to read the schema from the schema registry.

use the from_avro fucntion to turn that data into a df!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.