Converting a JSONArray into Java Spark Dataframe

Question

I am posting this question after searching a lot on the web but couldn't find the answer. I have a JSONArray in below format

[ 
  {
    "firstName":"John",
    "lastName":"Doe",
    "deparment" : {
       "DeptCode":"10",
       "deptName" : "HR"
     }
  },
  {
    "firstName":"Mel",
    "lastName":"Gibson",
    "deparment" : {
       "DeptCode":"20",
       "deptName" : "IT"
 }
}
]

The JSONArray is from org.json.simple.JSONArray package. I am trying to convert this into Java Spark Dataframe. I was trying with the below code :

SparkConf conf = new SparkConf().setAppName("linecount").setMaster("local[*]");
SparkSession session = SparkSession.builder().config(conf).getOrCreate();       
Dataset<Row> dataset = session.read().json(array.toString());

But no luck. I am facing below error. Also I can see in scala we can convert it to Dataframe using DS method. has someone tried this before ?

Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [{"firstName":%22John%22,%22lastName%22:%22Doe%22%7D,%7B%22firstName%22:%22Mel%22,%22lastName%22:%22Gibson%22%7D%5D
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:615)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:333)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:279)
at com.vikas.rawat.AnotherMainClass.main(AnotherMainClass.java:34)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: [{"firstName":%22John%22,%22lastName%22:%22Doe%22%7D,%7B%22firstName%22:%22Mel%22,%22lastName%22:%22Gibson%22%7D%5D
at java.net.URI.checkPath(Unknown Source)
at java.net.URI.<init>(Unknown Source)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 14 more

What to you mean by "no luck"? Compilation errors? Runtime errors? Error messages? Stack traces? Please be specific. (For what it is worth, programming is not about luck ...) — Stephen C
– Stephen C, Commented Jan 13, 2022 at 9:00
@StephenC seems that it is not a right way to do. Also it's throwing below error : Caused by: java.net.URISyntaxException: Relative path in absolute URI: [{"firstName":%22John%22,%22lastName%22:%22Doe%22%7D — user3199285
– user3199285, Commented Jan 13, 2022 at 9:12
Please EDIT your question to include the complete stacktrace as text. — Stephen C
– Stephen C, Commented Jan 13, 2022 at 9:34
I don't know the answer ... but I know why json(array.toString()) doesn't work. The json method expects its argument to be a string that is a path to a file in the file system. — Stephen C
– Stephen C, Commented Jan 13, 2022 at 10:26

Mohana B C · Accepted Answer · 2022-01-13 14:50:48Z

1

You should create a RDD from JSON string and pass that to spark.read.json method.

SparkSession spark = SparkSession.builder().master("local").getOrCreate();

String s = "{\"root\":[ \n" +
                "  {\n" +
                "    \"firstName\":\"John\",\n" +
                "    \"lastName\":\"Doe\",\n" +
                "    \"deparment\" : {\n" +
                "       \"DeptCode\":\"10\",\n" +
                "       \"deptName\" : \"HR\"\n" +
                "     }\n" +
                "  },\n" +
                "  {\n" +
                "    \"firstName\":\"Mel\",\n" +
                "    \"lastName\":\"Gibson\",\n" +
                "    \"deparment\" : {\n" +
                "       \"DeptCode\":\"20\",\n" +
                "       \"deptName\" : \"IT\"\n" +
                " }\n" +
                "}\n" +
                "]}";
JSONObject json = (JSONObject) JSONValue.parse(s);
JSONArray msgsArray = (JSONArray) json.get("root");

scala.collection.Seq<String> seq = scala.collection.JavaConverters.asScalaIteratorConverter
                       (Arrays.asList(msgsArray.toString()).iterator()).asScala().toSeq();

RDD<String> jsonRDD = spark.sparkContext().
parallelize(seq, 4, scala.reflect.ClassTag$.MODULE$.apply(String.class));

spark.read().json(jsonRDD).show();


+---------+---------+--------+
|deparment|firstName|lastName|
+---------+---------+--------+
| {10, HR}|     John|     Doe|
| {20, IT}|      Mel|  Gibson|
+---------+---------+--------+

answered Jan 13, 2022 at 14:50

Mohana B C

5,4721 gold badge13 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3199285 Over a year ago

what If I flatten the department as well. like the deptcode and deptname should be the col name and it should omit department from it.

Mohana B C Over a year ago

If it's flattended in JSONArray itself, then you will get 2 columns as you mentioned. Otherwise you need to use explode function.

Matt Andruff · Accepted Answer · 2022-01-13 15:02:30Z

0

You can import json from a string to a dataset, but there is a caveat that it has to be one object per string.

Spark Documentation:

// Alternatively, a DataFrame can be created for a JSON dataset represented by // a Dataset[String] storing one JSON object per string val otherPeopleDataset = spark.createDataset( """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil) val otherPeople = spark.read.json(otherPeopleDataset) otherPeople.show()

answered Jan 13, 2022 at 15:02

Matt Andruff

5,1901 gold badge7 silver badges25 bronze badges

Collectives™ on Stack Overflow

Converting a JSONArray into Java Spark Dataframe

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related