I'm using ydata-profiling to generate profiling reports from a large PySpark DataFrame without converting it to Pandas (to avoid memory issues on large datasets). Some columns contain the string "UNKNOWN", which I replace with None:
df = df.na.replace("UNKNOWN", None)
This works fine in PySpark: when I check with df.selectExpr("count(*)", "count_if(col_name IS NULL)").show() or df.filter(col("col_name").isNull()).count() I see the correct number of missing values. The Problem: When I run ydata-profiling directly on the PySpark DataFrame:
from ydata_profiling import ProfileReport
report = ProfileReport(df, minimal=True)
report.to_file("report.html")
... the report does only show missing values in the categorical columns. In numerical columns, I see mean = NaN, but missing = 0, which is contradictory, even though there should be missing values displayed.
How can I ensure that ydata-profiling correctly detects missing values in PySpark DataFrames – especially in numerical columns – without having to call .toPandas()?