1

I have a DataFrame called good_df that has mixed types of columns. I'm trying to set any empty values to 'null' for columns of StringType. I would think the code below would work, but it's not.

self.good_df = self.good_df.select([when((col(c)=='') & (isinstance(self.good_df.schema[c].dataType, StringType)),'null').otherwise(col(c)).alias(c) for c in self.good_df.columns])

I'm looking at the error message and it's not giving me much in the way of clues:

Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/site-packages/pyspark/sql/column.py", line 116, in _ njc = getattr(self._jc, name)(jc) File "/usr/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/python2.7/site-packages/py4j/protocol.py", line 332, in get_return_value format(target_id, ".", name, value)) Py4JError: An error occurred while calling o792.and. Trace: py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

Does anyone have any ideas on what is going on? Thank you!

1 Answer 1

2

The error message you got:

py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist

This means you're trying to apply AND operator between a Column expression and a literal Boolean value.

You need change this part:

(isinstance(self.good_df.schema[c].dataType, StringType))

to

from pyspark.sql.functions import lit

lit(isinstance(self.good_df.schema[c].dataType, StringType))

That said, actually you can move the condition to check the column type into the python list-comprehension directly:

self.good_df = self.good_df.select(*[
    when((col(c) == ''), 'null').otherwise(col(c)).alias(c) if t == "string" else col(c)
    for c, t in self.good_df.dtypes
])
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very, very much! The last snippet worked like a charm. You are awesome!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.