0

Spark Version: 2.1

Scala Version: 2.11

I have a dataframe with following structure before writing it and store into parquet file. It has lot of other columns but i cut it short to only 2 columns for clarity:

+---+--------------------+
|day|   table_row        |
+---+--------------------+
|  8|[,129,,,,,J,WENDI...|
|  8|[_DELETE_THIS_,_D...|
|  8|[_DELETE_THIS_,_D...|

...and the schema looks like this:

     root 
     |-- day: long (nullable = true)
     |-- table_row: struct (nullable = true)
     |    |-- DATE: string (nullable = true)
     |    |-- ADMISSION_NUM: string (nullable = true)
     |    |-- SOURCE_CODE: string (nullable = true)
etc..

'table_row' has over 100 data elements and i only posted a snippet. During processing i had to create couple of dummy rows with each field populated with "_DELETE_THIS_". For every normal row i have 2 dummy rows. Now i am trying to filter these dummy rows out of the dataframe and write only the valid rows but i am not able to do that using any means. I tried a couple ways but couldnt find a proper solution. Can someone help me this?

Thanks Qubiter

1 Answer 1

1

You can use filter function. You can take any field element from table_row as you said that each field is populated with _DELETE_THIS_

val finalDF = df.filter($"table_row.DATE" =!= "_DELETE_THIS_")

Here $"table_row.DATE" is how you call DATE element of the struct column.

I hope the answer is helpful.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.