1

Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.

So i am running the following and for some reason it gives me an OK output:

check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()

I am missing something within the filter function or we can't extract empty rows from dataframes.

2 Answers 2

3

You could use df.dropna() to drop empty rows and then compare the counts.

Something like

df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Andrew, but i would like to check the content of those rows so i have more clear idea of what's happening.
the wired things that i got zero, but the same piece of code has works fine with dataframe resulted from dropna transfromation instead it throw the exception without the one dropna
0

You could use an inbuilt option for dealing with such scenarios.

val df = spark.read
     .format("csv")
     .option("header", "true")
     .option("mode", "DROPMALFORMED") // Drop empty/malformed rows
     .load("hdfs:///path/file.csv")

Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.