How to filter a dataframe column containing Array/Struct

Question

Spark Version: 2.1

Scala Version: 2.11

I have a dataframe with following structure before writing it and store into parquet file. It has lot of other columns but i cut it short to only 2 columns for clarity:

+---+--------------------+
|day|   table_row        |
+---+--------------------+
|  8|[,129,,,,,J,WENDI...|
|  8|[_DELETE_THIS_,_D...|
|  8|[_DELETE_THIS_,_D...|

...and the schema looks like this:

     root 
     |-- day: long (nullable = true)
     |-- table_row: struct (nullable = true)
     |    |-- DATE: string (nullable = true)
     |    |-- ADMISSION_NUM: string (nullable = true)
     |    |-- SOURCE_CODE: string (nullable = true)
etc..

'table_row' has over 100 data elements and i only posted a snippet. During processing i had to create couple of dummy rows with each field populated with "_DELETE_THIS_". For every normal row i have 2 dummy rows. Now i am trying to filter these dummy rows out of the dataframe and write only the valid rows but i am not able to do that using any means. I tried a couple ways but couldnt find a proper solution. Can someone help me this?

Thanks Qubiter

Anahcolus · Accepted Answer · 2017-10-25 03:40:42Z

1

You can use filter function. You can take any field element from table_row as you said that each field is populated with _DELETE_THIS_

val finalDF = df.filter($"table_row.DATE" =!= "_DELETE_THIS_")

Here $"table_row.DATE" is how you call DATE element of the struct column.

I hope the answer is helpful.

answered Oct 25, 2017 at 3:40

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to filter a dataframe column containing Array/Struct

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related