Parse Json Object with an Array and Map to Multiple Pairs with Apache Spark in Java

Question

I've googled it all day long and couldn't find straight answer, so ended up posting a question here.

I have a file containing line-delimited json objects:

{"device_id": "103b", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}
{"device_id": "103b", "timestamp": 1435677490, "rooms": ["Office", "Lab"]}
{"device_id": "103b", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]}

My goal is to parse this file with Apache Spark in Java. I referenced How to Parsing CSV or JSON File with Apache Spark and so far I could successfully parse each line of json to JavaRDD using Gson.

JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("fileName");
JavaRDD<JsonObject> records = data.map(new Function<String, JsonObject>() {
    public JsonObject call(String line) throws Exception {
        Gson gson = new Gson();
        JsonObject json = gson.fromJson(line, JsonObject.class);
        return json;
    }
});

Where I'm really stuck is I want to deserialize the "rooms" array so that it can fit to my class Event.

public class Event implements Serializable {
    public static final long serialVersionUID = 42L;
    private String deviceId;
    private int timestamp;
    private String room;
    // constructor , getters and setters 
}

In other words, from this line:

{"device_id": "103b", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}

I want to create two Event objects in Spark:

obj1: deviceId = "103b", timestamp = 1436941050, room = "Office"
obj2: deviceId = "103b", timestamp = 1436941050, room = "Foyer"

I did my little search and tried flatMapVlue, but no luck... It threw me an error...

JavaRDD<Event> events = records.flatMapValue(new Function<JsonObject, Iterable<Event>>() {
    public Iterable<Event> call(JsonObject json) throws Exception {
        JsonArray rooms = json.get("rooms").getAsJsonArray();
        List<Event> data = new LinkedList<Event>();
        for (JsonElement room : rooms) {
            data.add(new Event(json.get("device_id").getAsString(), json.get("timestamp").getAsInt(), room.toString()));
        }
        return data;
    }
});

I'm very new to Spark and Map/Reduce. I would be grateful if you can help me out. Thanks in advance!

Please, post your error. Edit your post and add the stacktrace — Vladimir Vagaytsev
– Vladimir Vagaytsev, Commented Jul 13, 2016 at 7:53

Yuan JI · Accepted Answer · 2016-07-13 08:45:04Z

3

If you load json data into a DataFrame:

DataFrame df = sqlContext.read().json("/path/to/json");

You could easily do this by explode.

df.select(
    df.col("device_id"),
    df.col("timestamp"),
    org.apache.spark.sql.functions.explode(df.col("rooms")).as("room")
);

For input:

{"device_id": "1", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}
{"device_id": "2", "timestamp": 1435677490, "rooms": ["Office", "Lab"]}
{"device_id": "3", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]}

You will get:

+---------+------+----------+
|device_id|  room| timestamp|
+---------+------+----------+
|        1|Office|1436941050|
|        1| Foyer|1436941050|
|        2|Office|1435677490|
|        2|   Lab|1435677490|
|        3|Office|1436673850|
|        3| Foyer|1436673850|
+---------+------+----------+

edited Jul 13, 2016 at 8:45

answered Jul 13, 2016 at 8:37

Yuan JI

2,9952 gold badges23 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

gyoho Over a year ago

Thanks for letting me know this useful feature. I didn't know Spark supports Hive like UDF. It's very helpful!

Yuan JI Over a year ago

spark is full compatible with hive (*≧▽≦)

Alien · Accepted Answer · 2018-09-02 06:56:41Z

1

val formatrecord = records.map(fromJson[mapClass](_))

mapClass should be a case class for mapping the object inside the records json.

edited Sep 2, 2018 at 6:56

Alien

16k8 gold badges43 silver badges64 bronze badges

answered Sep 2, 2018 at 6:28

Premkumar S

413 bronze badges

Collectives™ on Stack Overflow

Parse Json Object with an Array and Map to Multiple Pairs with Apache Spark in Java

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related