Pyspark tuple object has no attribute split

Question

I am struggling with a Pyspark assignment. I am required to get a sum of all the viewing numbers per channels. I have 2 sets of files: 1 showing the show and views per show the other showing the shows and what channel they are shown on (can be multiple).

I have performed a join operation on the 2 files and the result looks like ..

[(u'Surreal_News', (u'BAT', u'11')),
 (u'Hourly_Sports', (u'CNO', u'79')),
 (u'Hourly_Sports', (u'CNO', u'3')),

I now need to extract the channel as the key and then I think do a reduceByKey to get the sum of views for the channels.

I have written this function to extract the chan as key with the views alongside, which I could then use a reduceByKey function to sum the results. However when I try to display results of below function with collect() I get an "AttributeError: 'tuple' object has no attribute 'split'" error

def extract_chan_views(show_chan_views):
    key_value = show_chan_views.split(",")
    chan_views = key_value[1].split(",")
    chan = chan_views[0]
    views = int(chan_views[1])
    return (chan,views)

Jeff · Accepted Answer · 2016-08-13 13:50:45Z

1

Since this is an assignment, I'll try to explain what's going on rather than just doing the answer. Hopefully that will be more helpful!

This actually isn't anything to do with pySpark; it's just a plain Python issue. Like the error is saying, you're trying to split a tuple, when split is a string operation. Instead access them by index. The object you're passing in:

[(u'Surreal_News', (u'BAT', u'11')),
 (u'Hourly_Sports', (u'CNO', u'79')),
 (u'Hourly_Sports', (u'CNO', u'3')),

is a list of tuples, where the first index is a unicode string and the second is another tuple. You can split them apart like this (I'll annotate each step with comments):

for item in your_list:
    #item = (u'Surreal_News', (u'BAT', u'11')) on iteration one

    first_index, second_index = item #this will unpack the two indices
    #now:
    #first_index = u'Surreal_News'
    #second_index = (u'BAT', u'11')

    first_sub_index, second_sub_index = second_index #unpack again
    #now:
    #first_sub_index = u'BAT'
    #second_sub_index = u'11'

Note that you never had to split on commas anywhere. Also note that the u'11' is a string, not an integer in your data. It can be converted, as long as you're sure it's never malformed, with int(u'11'). Or if you prefer specifying indices to unpacking, you can do the same thing:

first_index, second_index = item

is equivalent to:

first_index = item[0]
second_index = item[1]

Also note that this gets more complicated if you are unsure what form the data will take - that is, if sometimes the objects have two items in them, other times three. In that case unpacking and indexing in a generalized way for a loop require a bit more thought.

edited Aug 13, 2016 at 13:50

answered Aug 13, 2016 at 13:45

Jeff

2,2381 gold badge19 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

chucknor Over a year ago

Thanks for the quick response Jeff. When I use that code in my function I get the error "TypeError: 'type' object is not iterable". My Python knowledge is not great so I will do a bit more research online and find some Pyspark examples that loop through data. But what is the object that is not iterable. I thought that my function takes only a single line of another RDD as an argument, so is that why it's saying its not iterable?

chucknor Over a year ago

Is the 'your_list' you mention in your example the argument that my function takes, so in my case 'show_chan_views'? When I try the below code I get the error on line 3 of 'Too many values to unpack'. In [96]: def extract_chan_views(show_chan_views): ....: for item in show_chan_views: ....: first_index, second_index = item ....: first_sub_index, second_sub_index = second_index ....: return (first_sub_index, second_sub_index)

Jeff Over a year ago

So the first error you're getting tells us that whatever you're trying to go through in the for loop isn't something that can be iterated through. Check what it is first. The second error means you're trying to unpack to, for example, two variables when your object has three.

chucknor Over a year ago

I don't understand, the object should only have 2 variables as you have stated above. Also I can prove this if I run the following code .. for item in show_chan_views: ..... f_ind, s_ind = item:.....print len(s_i). This will return 2 for each.

chucknor Over a year ago

I found the solution. To access the channel and views I need to use chan = show_chan_views[1][0] etc. as the data is presented as a 2D array.

Tarun Teja · Accepted Answer · 2018-05-18 06:54:25Z

1

I am not exactly resolving your code , but I faced same error when I applied join transformation on two datasets.

lets say , A and B are two RDDs.

c = A.join(B)

We may think that c is also Rdd , wrong. It is a tuple object where we cannot perform any split(",") kind of operations.One needs to make c into Rdd then proceed.

If we want tuple to be accessed, Lets say D is tuple.

E= D[1] // instead of E= D.split(",")[1]

edited May 18, 2018 at 6:54

answered May 17, 2018 at 9:49

Tarun Teja

215 bronze badges

Collectives™ on Stack Overflow

Pyspark tuple object has no attribute split

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related