How to detect null column in pyspark

Question

I have a dataframe defined with some null values. Some Columns are fully null values.

>> df.show()
+---+---+---+----+
|  A|  B|  C|   D|
+---+---+---+----+
|1.0|4.0|7.0|null|
|2.0|5.0|7.0|null|
|3.0|6.0|5.0|null|
+---+---+---+----+

In my case, I want to return a list of columns name that are filled with null values. My idea was to detect the constant columns (as the whole column contains the same null value).

this is how I did it:

nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const]

but this does no consider null columns as constant, it works only with values. How should I then do it ?

desertnaut · Accepted Answer · 2017-11-23 14:26:22Z

6

One way would be to do it implicitly: select each column, count its NULL values, and then compare this with the total number or rows. With your data, this would be:

spark.version
# u'2.2.0'

from pyspark.sql.functions import col

nullColumns = []
numRows = df.count()
for k in df.columns:
  nullRows = df.where(col(k).isNull()).count()
  if nullRows ==  numRows: # i.e. if ALL values are NULL
    nullColumns.append(k)

nullColumns
# ['D']

But there is a simpler way: it turns out that the function countDistinct, when applied to a column with all NULL values, returns zero (0):

from pyspark.sql.functions import countDistinct

df.agg(countDistinct(df.D).alias('distinct')).collect()
# [Row(distinct=0)]

So the for loop now can be:

nullColumns = []
for k in df.columns:
  if df.agg(countDistinct(df[k])).collect()[0][0] == 0:
    nullColumns.append(k)

nullColumns
# ['D']

UPDATE (after comments): It seems possible to avoid collect in the second solution; since df.agg returns a dataframe with only one row, replacing collect with take(1) will safely do the job:

nullColumns = []
for k in df.columns:
  if df.agg(countDistinct(df[k])).take(1)[0][0] == 0:
    nullColumns.append(k)

nullColumns
# ['D']

edited Nov 23, 2017 at 14:26

answered Nov 23, 2017 at 13:08

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

desertnaut Over a year ago

@MehdiBenHamida Careful! collect here is not applied to the whole of your data, but to an aggregation. Arguably, this is a legitimate use of collect.

Mehdi Ben Hamida Over a year ago

I know that collect is about the aggregation but still consuming a lot of performance :/

desertnaut Over a year ago

@MehdiBenHamida perhaps you have not realized that what you ask is not at all trivial: one way or another, you'll have to go through all your rows, checking for NULLs; now, if you are not happy even to use countDistinct... (as if min and max would take less time, if applicable)

Mehdi Ben Hamida Over a year ago

this will consume a lot time to detect all null columns, I think there is a better alternative. For the first suggested solution, I tried it; it better than the second one but still taking too much time. I think, there is a better alternative!

zero323 · Accepted Answer · 2017-11-23 14:20:44Z

4

Extend the condition to

from pyspark.sql.functions import min, max

((min(c).isNull() & max(c).isNull()) | (min(c) == max(c))).alias(c)

or use eqNullSafe (PySpark 2.3):

(min(c).eqNullSafe(max(c))).alias(c)

edited Nov 23, 2017 at 14:20

zero323

331k108 gold badges981 silver badges958 bronze badges

answered Nov 23, 2017 at 12:57

user8996419

891 bronze badge

2 Comments

Mehdi Ben Hamida Over a year ago

@desertnaut: this is a pretty faster, takes only decim seconds :D

matt Over a year ago

This works for the case when all values in the column are null. But consider the case with column values of [null, 1, 1, null] . In this case, the min and max will both equal 1 . Thus, will get identified incorrectly as having all nulls.

matt · Accepted Answer · 2020-02-20 19:38:07Z

4

How about this? In order to guarantee the column are all nulls, two properties must be satisfied:

(1) The min value is equal to the max value

(2) The min or max is null

Or, equivalently

(1) The min AND max are both equal to None

Note that if property (2) is not satisfied, the case where column values are [null, 1, null, 1] would be incorrectly reported since the min and max will be 1.

import pyspark.sql.functions as F

def get_null_column_names(df):
    column_names = []

    for col_name in df.columns:

        min_ = df.select(F.min(col_name)).first()[0]
        max_ = df.select(F.max(col_name)).first()[0]

        if min_ is None and max_ is None:
            column_names.append(col_name)

    return column_names

Here's an example in practice:

>>> rows = [(None, 18, None, None),
            (1, None, None, None),
            (1, 9, 4.0, None),
            (None, 0, 0., None)]

>>> schema = "a: int, b: int, c: float, d:int"

>>> df = spark.createDataFrame(data=rows, schema=schema)

>>> df.show()

+----+----+----+----+
|   a|   b|   c|   d|
+----+----+----+----+
|null|  18|null|null|
|   1|null|null|null|
|   1|   9| 4.0|null|
|null|   0| 0.0|null|
+----+----+----+----+

>>> get_null_column_names(df)
['d']

edited Feb 20, 2020 at 19:38

answered Jul 6, 2018 at 16:49

matt

8438 silver badges12 bronze badges

2 Comments

Gunay Anach Over a year ago

If Anyone is wondering from where F comes. It is Functions imported as F | from pyspark.sql import functions as F

matt Over a year ago

Good catch @GunayAnach. I updated the answer to include this.

s510 · Accepted Answer · 2023-11-24 12:18:32Z

Time effective: Spark> 3.1 If you have more Not Null columns compared to Null columns

Generally checking if an entire col is Null is a time intensive operation, however generally calling isNotNull() is faster compared to isNull() call on NotNull columns. The following trick with limit() can fasten the approach by a lot if you expect more Not Null columns compared to Null cols.

# col_a is a Null col
st = time.time()
print("Result:", sdf.where(sdf.col_a.isNotNull()).limit(1).count())
print("Time Taken:", time.time() - st)
# Output
# Result: 0
# Time Taken: 215.386

# col_b is a Not Null col
st = time.time()
print("Result:", sdf.where(sdf.col_b.isNotNull()).limit(1).count())
print("Time Taken:", time.time() - st)
# Output
# Result: 1
# Time Taken: 7.857

Where a result of 0 means it is a null col and 1 it is otherwise. Now you can loop over all columns in the spark dataframe.

Collectives™ on Stack Overflow

How to detect null column in pyspark

4 Answers 4

4 Comments

2 Comments

2 Comments

Time effective: Spark> 3.1 If you have more Not Null columns compared to Null columns

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

2 Comments

2 Comments

Time effective: Spark> 3.1 If you have more Not Null columns compared to Null columns

Comments

Your Answer

Sign up or log in

Post as a guest

Related