32

Having this dataframe I am getting Column is not iterable when I try to groupBy and getting max:

linesWithSparkDF
+---+-----+
| id|cycle|
+---+-----+
| 31|   26|
| 31|   28|
| 31|   29|
| 31|   97|
| 31|   98|
| 31|  100|
| 31|  101|
| 31|  111|
| 31|  112|
| 31|  113|
+---+-----+
only showing top 10 rows


ipython-input-41-373452512490> in runlgmodel2(model, data)
     65     linesWithSparkDF.show(10)
     66 
---> 67     linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(max(col("cycle")))
     68     print "linesWithSparkGDF"
     69 

/usr/hdp/current/spark-client/python/pyspark/sql/column.py in __iter__(self)
    241 
    242     def __iter__(self):
--> 243         raise TypeError("Column is not iterable")
    244 
    245     # string methods

TypeError: Column is not iterable

4 Answers 4

47

It's because, you've overwritten the max definition provided by apache-spark, it was easy to spot because max was expecting an iterable.

To fix this, you can use a different syntax, and it should work:

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})

Or, alternatively:

from pyspark.sql.functions import max as sparkMax

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))
Sign up to request clarification or add additional context in comments.

1 Comment

Gee : ... i <3 scala!
36

The idiomatic style for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to import the Spark SQL functions module like this:

from pyspark.sql import functions as F 
# USAGE: F.col(), F.max(), F.someFunc(), ...

Then, using the OP's example, you'd simply apply F like this:

linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \
                               .agg(F.max(F.col("cycle")))

In practice, this is how the problem is avoided idiomatically. =:)

3 Comments

This approach is in-fact straight forward and works like a charm.
If I created a UDF in Python? Is there some way to apply it on all columns? Think about... you created a function UDF to apply default format without special caracters and in uppercase.
Hi @FernandoDelago This question might help you generically: stackoverflow.com/questions/34037889/…
3

I know the question is old but this might help someone.

First import the following :

from pyspark.sql import functions as F

Then

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(F.max(col("cycle")))

1 Comment

Should be F.col I guess?
1

I faced the similar issue, although error looks mischievous but we can resolve the same to check if we missed the following import-

from pyspark.sql.functions import *

this will get the required functions to aggregate the data if datatypes of columns are right. I fixed the similar issue by adding the required import, so don't forget that to check...

1 Comment

Yes, forgetting the import can cause this. Because min and max are also Builtins, and then your'e not using the pyspark max but the builtin max. I wouldn't import * though, rather from pyspark.sql import functions as F and prefix your max like so: F.max. Or from pyspark.sql.functions import max as f_max to avoid confusion.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.