How to flatten nested struct in array?

Question

This is my current schema :

 |-- _id: string (nullable = true)
 |-- person: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- adr1: struct (nullable = true)
 |    |    |    |-- resid: string (nullable = true)

And this is what I want to obtain :

 |-- _id: string (nullable = true)
 |-- person: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- resid: string (nullable = true)

I am using the java api.

@T.Gawęda When I saw it I thought the same, but it took me few hours to have a solution (first the test data and then the solution). It's a nice exercise I'd suggest you giving a try. — Jacek Laskowski
– Jacek Laskowski, Commented Jun 26, 2017 at 21:11
@JacekLaskowski It's a nice exercise, but I can't spent too much time in work to write answers ;) So I just wanted to give a hint that may help the author :) — T. Gawęda
– T. Gawęda, Commented Jun 26, 2017 at 21:16

Piotr Kalański · Accepted Answer · 2017-06-27 17:28:22Z

3

You can use map transformation:

import java.util.Arrays;
import java.util.stream.Collectors;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;

Encoder<PeopleFlatten> peopleFlattenEncoder = Encoders.bean(PeopleFlatten.class);

people
  .map(person -> new PeopleFlatten(
      person.get_id(),
      person.getPerson().stream().map(p ->
        new PersonFlatten(
          p.getName(),
          p.getAdr1().getResid()
        )
      ).collect(Collectors.toList())
    ),
    peopleFlattenEncoder
  );

where PeopleFlatten and PersonFlatten are POJO corresponding to expected schema in question.

public class PeopleFlatten implements Serializable {
   private String _id;
   private List<PersonFlatten> person;
   // getters and setters
}

public class PersonFlatten implements Serializable {
   private String name;
   private String resid;
   // getters and setters
}

edited Jun 27, 2017 at 17:28

answered Jun 26, 2017 at 17:58

Piotr Kalański

6891 gold badge5 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jacek Laskowski · Accepted Answer · 2017-06-26 21:05:19Z

If it were Scala, I'd do the following, but since the OP asked about Java, I'm offering it as a guidance only.

Solution 1 - Memory-Heavy

case class Address(resid: String)
case class Person(name: String, adr1: Address)

val people = Seq(
  ("one", Array(Person("hello", Address("1")), Person("world", Address("2"))))
).toDF("_id", "persons")

import org.apache.spark.sql.Row
people.as[(String, Array[Person])].map { case (_id, arr) => 
  (_id, arr.map { case Person(name, Address(resid)) => (name, resid) })
}

This approach however is quite memory expensive as the internal binary rows are copied to their JVM objects that puts the environment to face OutOfMemoryErrors.

Solution 2 - Expensive but Language-Independent

The other query with worse performance (but less memory requirement too) could use explode operator to destructure the array first that would give us an easy access to internal structs.

val solution = people.
  select($"_id", explode($"persons") as "exploded"). // <-- that's expensive
  select("_id", "exploded.*"). // <-- this is the trick to access struct's fields
  select($"_id", $"name", $"adr1.resid").
  select($"_id", struct("name", "resid") as "person").
  groupBy("_id"). // <-- that's expensive
  agg(collect_list("person") as "persons")
scala> solution.printSchema
root
 |-- _id: string (nullable = true)
 |-- persons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- resid: string (nullable = true)

The nice thing about the solution is that it has almost nothing related to Scala or Java (so you could use it right away regardless of the language of your choice).

Collectives™ on Stack Overflow

How to flatten nested struct in array?

2 Answers 2

Comments

Solution 1 - Memory-Heavy

Solution 2 - Expensive but Language-Independent

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Solution 1 - Memory-Heavy

Solution 2 - Expensive but Language-Independent

Comments

Your Answer

Sign up or log in

Post as a guest

Related