pyspark Column is not iterable

Question

Having this dataframe I am getting Column is not iterable when I try to groupBy and getting max:

linesWithSparkDF
+---+-----+
| id|cycle|
+---+-----+
| 31|   26|
| 31|   28|
| 31|   29|
| 31|   97|
| 31|   98|
| 31|  100|
| 31|  101|
| 31|  111|
| 31|  112|
| 31|  113|
+---+-----+
only showing top 10 rows


ipython-input-41-373452512490> in runlgmodel2(model, data)
     65     linesWithSparkDF.show(10)
     66 
---> 67     linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(max(col("cycle")))
     68     print "linesWithSparkGDF"
     69 

/usr/hdp/current/spark-client/python/pyspark/sql/column.py in __iter__(self)
    241 
    242     def __iter__(self):
--> 243         raise TypeError("Column is not iterable")
    244 
    245     # string methods

TypeError: Column is not iterable

Rcheologist · Accepted Answer · 2023-03-26 18:58:13Z

47

It's because, you've overwritten the max definition provided by apache-spark, it was easy to spot because max was expecting an iterable.

To fix this, you can use a different syntax, and it should work:

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})

Or, alternatively:

from pyspark.sql.functions import max as sparkMax

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))

edited Mar 26, 2023 at 18:58

Rcheologist

3232 silver badges15 bronze badges

answered Apr 28, 2016 at 20:33

Alberto Bonsanto

18.1k10 gold badges67 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

oluies Over a year ago

Gee : ... i <3 scala!

nmvega · Accepted Answer · 2021-05-27 05:28:11Z

36

The idiomatic style for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to import the Spark SQL functions module like this:

from pyspark.sql import functions as F 
# USAGE: F.col(), F.max(), F.someFunc(), ...

Then, using the OP's example, you'd simply apply F like this:

linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \
                               .agg(F.max(F.col("cycle")))

In practice, this is how the problem is avoided idiomatically. =:)

edited May 27, 2021 at 5:28

answered Feb 6, 2019 at 17:15

nmvega

5,7599 gold badges63 silver badges71 bronze badges

3 Comments

user1617791 Over a year ago

This approach is in-fact straight forward and works like a charm.

Fernando Delago Over a year ago

If I created a UDF in Python? Is there some way to apply it on all columns? Think about... you created a function UDF to apply default format without special caracters and in uppercase.

nmvega Over a year ago

Hi @FernandoDelago This question might help you generically: stackoverflow.com/questions/34037889/…

SamaAdi · Accepted Answer · 2022-05-20 06:09:14Z

3

I know the question is old but this might help someone.

First import the following :

from pyspark.sql import functions as F

Then

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(F.max(col("cycle")))

edited May 20, 2022 at 6:09

answered May 20, 2022 at 6:00

SamaAdi

511 silver badge6 bronze badges

1 Comment

Siete Over a year ago

Should be F.col I guess?

satin · Accepted Answer · 2021-08-26 17:40:51Z

1

I faced the similar issue, although error looks mischievous but we can resolve the same to check if we missed the following import-

from pyspark.sql.functions import *

this will get the required functions to aggregate the data if datatypes of columns are right. I fixed the similar issue by adding the required import, so don't forget that to check...

answered Aug 26, 2021 at 17:40

satin

211 bronze badge

1 Comment

wbloos Dec 13, 2024 at 9:06

Yes, forgetting the import can cause this. Because min and max are also Builtins, and then your'e not using the pyspark max but the builtin max. I wouldn't import * though, rather from pyspark.sql import functions as F and prefix your max like so: F.max. Or from pyspark.sql.functions import max as f_max to avoid confusion.

Collectives™ on Stack Overflow

pyspark Column is not iterable

4 Answers 4

1 Comment

3 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related