0

This is the JSON File [https://drive.google.com/file/d/1Jb3OdoffyA71vYfojxLedZNPDLq9bn7b/view?usp=sharing]

I am new to SCALA, I am learning how to use SCALA to parse JSON files and ingest them into Spark as a table. I know how to do that in Python but I am having trouble doing it in SCALA.

The table/dataframe will look like this after parsing the JSON file below

  id          pub_date      doc_id       unique_id     c_id    p_id    type      source
lni001        20220301      7098727     64WP-UI-POLI    002     P02    org      internet
lni001        20220301      7098727     64WP-UI-POLI    002     P02    org      internet
lni001        20220301      7098727     64WP-UI-POLI    002     P02    org      internet
lni002        20220301      7097889     64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      7097889     64WP-UI-CFGT    012     K21   location  internet

That will be great if I can get some help or ideas on how to do this. Thanks!

Here is the code that I used

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

import spark.implicits._

val df = spark.read.option("multiline", true).json("json_path")
df.show()

But the code cannot parse the nested part (content field). Here is a peak of the data

{
   "id":"lni001",
   "pub_date":"20220301",
   "doc_id":"7098727",
   "unique_id":"64WP-UI-POLI",
   "content":[
      {
         "c_id":"002",
         "p_id":"P02",
         "type":"org",
         "source":"internet"  
      },
      {
         "c_id":"002",
         "p_id":"P02",
         "type":"org",
         "source":"internet" 
      },
      {
         "c_id":"002",
         "p_id":"P02",
         "type":"org",
         "source":"internet" 
      }
   ]
}

1 Answer 1

1

You should specify schema,spark may unable to infer schema internally . You can try this way:

  val schema= StructType(Array(
  StructField("id",StringType),
  StructField("pub_date",StringType),
  StructField("doc_id",StringType),
  StructField("unique_id",StringType),
  StructField("content",ArrayType(MapType(StringType,StringType)))))
spark.read
 .option("multiline", true)
 .schema(schema)
 .json("path")
 .show(false)
Sign up to request clarification or add additional context in comments.

1 Comment

Got it. Thank you so much. Its my first time using Scala haha

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.