0

I'm using ydata-profiling to generate profiling reports from a large PySpark DataFrame without converting it to Pandas (to avoid memory issues on large datasets). Some columns contain the string "UNKNOWN", which I replace with None:

df = df.na.replace("UNKNOWN", None)

This works fine in PySpark: when I check with df.selectExpr("count(*)", "count_if(col_name IS NULL)").show() or df.filter(col("col_name").isNull()).count() I see the correct number of missing values. The Problem: When I run ydata-profiling directly on the PySpark DataFrame:

from ydata_profiling import ProfileReport
report = ProfileReport(df, minimal=True)
report.to_file("report.html")

... the report does only show missing values in the categorical columns. In numerical columns, I see mean = NaN, but missing = 0, which is contradictory, even though there should be missing values displayed.

How can I ensure that ydata-profiling correctly detects missing values in PySpark DataFrames – especially in numerical columns – without having to call .toPandas()?

1
  • based on their documentation, missing value analysis isn't supported yet for spark dataframes. see Commented Apr 12 at 23:47

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.