How to read json file to spark dataframe without those data have null value in some column?

Question

My data is like this:

{"id":"1","time":123,"sth":100} 
{"id":"2","sth":456} 
{"id":"3","time":789,"sth":300}

And I write my schema as:

StructType(
  Array(
    StructField("id", StringType, false),
    StructField("time", StringType, false),
    StructField("sth", StringType, true),
  )  
)

And I read my data using:

val df = spark.read.schema(buildSchema()).json(path)

What I want is that my dataframe doesn't read those lines without "time" value, so the result I want is

| id | time | sth |
| 1 | 123 | 100 |
| 3 | 789 | 300 |

However, even I set the nullable attribute as false in my StructField, it still read the second line {"id":"2","sth":456} into my table, and I need to waste time to drop those rows with null value after reading. Is there any way to do what I want efficiently?

Possible duplicate of DataFrameReadercsv(path: String) option for skipping blank lines — mtoto
– mtoto, Commented May 2, 2017 at 6:57

learner · Accepted Answer · 2017-05-02 09:03:10Z

3

You can try this,

    val otherPeopleRDD = spark.sparkContext.makeRDD(
          """[{"id":"1","time":123,"sth":100} ,
        {"id":"2","sth":456} ,
        {"id":"3","time":789,"sth":300} ] """ :: Nil)


        val otherPeople = spark.read.json(otherPeopleRDD).na.drop()
        otherPeople.show()


+---+---+----+
| id|sth|time|
+---+---+----+
|  1|100| 123|
|  3|300| 789|
+---+---+----+

edited May 2, 2017 at 9:03

answered May 2, 2017 at 7:37

learner

3443 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to read json file to spark dataframe without those data have null value in some column?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related