pyspark: TypeError: 'float' object is not iterable

Question

I'm working on a spark code, I always got error:

TypeError: 'float' object is not iterable

on the line of reduceByKey() function. Can someone help me? This is the stacktrace of the error:

d[k] = comb(d[k], v) if k in d else creator(v)
  File "/home/hw/SC/SC_spark.py", line 535, in <lambda>
TypeError: 'float' object is not iterable

Here is code:

def field_valid(m):
    dis=m[1]
    TxP=m[2]
    ef=m[3]
    pl=m[4]
    if TxP != 'NaN' and disl != 'NaN' and ef !='NaN' and pl != 'NaN':
        return True
    else:
        return False

def parse_input(d):
    #d=data.split(',')

    s_name='S'+d[6] # serving cell name

    if d[2] =='NaN' or d[2] == '':
        ef='NaN'
    else:
        ef=float(d[2].strip().rstrip())

    if d[7] =='NaN' or d[7] == '' or d[7] == '0':
        TxP='NaN'
    else:
        TxP=float(d[7].strip().rstrip())

    if d[9] =='NaN' or d[9] == '':
        dis='NaN'
    else:
        dis=float(d[9].strip().rstrip())

    if d[10] =='NaN' or d[10] == '':
        pl='NaN'
    else:
        pl=float(d[10].strip().rstrip())

return s_name,dis, TxP, ef, pl


sc=SparkContext(appName="SC_spark")
lines=sc.textFile(ip_file)
lines=lines.map(lambda m: (m.split(",")))
lines=lines.filter(lambda m: (m[6] != 'cell_name'))
my_rdd=lines.map(parse_input).filter(lambda m: (field_valid(m)==True))
my_rdd=my_rdd.map(lambda x: (x[0],(x[1],x[2])))                                                                                                                                          
my_rdd=my_rdd.reduceByKey(lambda x,y:(max(x[0],y[0]),sum(x[1],y[1])))  #this line got error

Here is some sample data:


Class,PB,EF,RP,RQ,ID,cell_name,TxP,BW,DIS,PL,geom
NaN,10,5110,-78.0,-7.0,134381669,S417|134381669|5110,62.78151250383644,10,2578.5795095469166,113.0,NaN
NaN,10,5110,-71.0,-6.599999904632568,134381669,S417|134381669|5110,62.78151250383644,10,2689.630258510342,106.0,NaN
NaN,10,5110,-77.0,-7.300000190734863,134381669,S417|134381669|5110,62.78151250383644,10,2907.8184899249713,112.0,19.299999999999983
NaN,10,5110,-91.0,-11.0,134381669,S417|134381669|5110,62.78151250383644,10,2779.96762695867,126.0,5.799999999999997
NaN,10,5110,-90.0,-12.69999980926514,134381669,S417|134381669|5110,62.78151250383644,10,2749.8351648579583,125.0,9.599999999999994
NaN,10,5110,-95.0,-13.80000019073486,134381669,S417|134381669|5110,62.78151250383644,10,2942.7938902934643,130.0,-2.4000000000000057
NaN,10,5110,-70.0,-7.099999904632568,134381669,S417|134381669|5110,62.78151250383644,10,3151.930706017461,105.0,22.69999999999999

I am not familiar with pyspark, but in the line where the error occurs you call sum with two arguments. Unless the first one is an iterable and the second an int, your error is probably there. Try calling sum(1.0, 2) on a python console. It gives me a very similar error. — bla
– bla, Commented Apr 22, 2018 at 6:07
Hi @bla, I just tested out, made sure all fields are converted to float. You noticed I filtered the line with NaN on those values, so, the number is float only. I also checked the syntax of lambda function, I separate to (k,v). I didn't find anything wrong. Did you find anything wrong? — Helen Z
– Helen Z, Commented Apr 22, 2018 at 6:14
What exactly is m.split(",") doing? You have no commas in the data — OneCricketeer
– OneCricketeer, Commented Apr 22, 2018 at 6:15
@HelenZ you cannot pass a float as the first argument of sum. It expects an interable. Check it out: docs.python.org/3.5/library/functions.html#sum. I cannot confirm that this is the case, since I am not sure x[1] is a float. But the stacktrace are very similar. — bla
– bla, Commented Apr 22, 2018 at 6:19
Hi @cricket_007. BTW i just changed to x[1]+y[1], and it works!! I'm new to spark, and can't distinguish spark1 and spark2 yet. can you tell me how to do in spark2? the expected result is sum and max of value 'dis' by the same key, and key is column 'cell_name'. — Helen Z
– Helen Z, Commented Apr 22, 2018 at 6:32

OneCricketeer · Accepted Answer · 2018-04-22 06:52:39Z

0

the expected result is sum and max of value

In that case, you are looking for x[1] + y[1], and not use the built-in sum() function.

my_rdd.reduceByKey( lambda x,y: ( max(x[0],y[0]), x[1] + y[1] ) )

edited Apr 22, 2018 at 6:52

answered Apr 22, 2018 at 6:50

OneCricketeer

193k20 gold badges146 silver badges276 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Helen Z Over a year ago

Hi @cricket_700, can I ask another question? Now I want to save result into a .txt file, but I want to add a header to the .txt file, how should I do it? I used this statement: my_rdd.repartition(1).saveAsTextFile("sc_result/result.txt")

OneCricketeer Over a year ago

You need to union your RDD with a header RDD. stackoverflow.com/questions/26157456/…

Collectives™ on Stack Overflow

pyspark: TypeError: 'float' object is not iterable

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related