0

I am struggling with a Pyspark assignment. I am required to get a sum of all the viewing numbers per channels. I have 2 sets of files: 1 showing the show and views per show the other showing the shows and what channel they are shown on (can be multiple).

I have performed a join operation on the 2 files and the result looks like ..

[(u'Surreal_News', (u'BAT', u'11')),
 (u'Hourly_Sports', (u'CNO', u'79')),
 (u'Hourly_Sports', (u'CNO', u'3')),

I now need to extract the channel as the key and then I think do a reduceByKey to get the sum of views for the channels.

I have written this function to extract the chan as key with the views alongside, which I could then use a reduceByKey function to sum the results. However when I try to display results of below function with collect() I get an "AttributeError: 'tuple' object has no attribute 'split'" error

def extract_chan_views(show_chan_views):
    key_value = show_chan_views.split(",")
    chan_views = key_value[1].split(",")
    chan = chan_views[0]
    views = int(chan_views[1])
    return (chan,views) 

2 Answers 2

1

Since this is an assignment, I'll try to explain what's going on rather than just doing the answer. Hopefully that will be more helpful!

This actually isn't anything to do with pySpark; it's just a plain Python issue. Like the error is saying, you're trying to split a tuple, when split is a string operation. Instead access them by index. The object you're passing in:

[(u'Surreal_News', (u'BAT', u'11')),
 (u'Hourly_Sports', (u'CNO', u'79')),
 (u'Hourly_Sports', (u'CNO', u'3')),

is a list of tuples, where the first index is a unicode string and the second is another tuple. You can split them apart like this (I'll annotate each step with comments):

for item in your_list:
    #item = (u'Surreal_News', (u'BAT', u'11')) on iteration one

    first_index, second_index = item #this will unpack the two indices
    #now:
    #first_index = u'Surreal_News'
    #second_index = (u'BAT', u'11')

    first_sub_index, second_sub_index = second_index #unpack again
    #now:
    #first_sub_index = u'BAT'
    #second_sub_index = u'11'

Note that you never had to split on commas anywhere. Also note that the u'11' is a string, not an integer in your data. It can be converted, as long as you're sure it's never malformed, with int(u'11'). Or if you prefer specifying indices to unpacking, you can do the same thing:

first_index, second_index = item

is equivalent to:

first_index = item[0]
second_index = item[1]

Also note that this gets more complicated if you are unsure what form the data will take - that is, if sometimes the objects have two items in them, other times three. In that case unpacking and indexing in a generalized way for a loop require a bit more thought.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the quick response Jeff. When I use that code in my function I get the error "TypeError: 'type' object is not iterable". My Python knowledge is not great so I will do a bit more research online and find some Pyspark examples that loop through data. But what is the object that is not iterable. I thought that my function takes only a single line of another RDD as an argument, so is that why it's saying its not iterable?
Is the 'your_list' you mention in your example the argument that my function takes, so in my case 'show_chan_views'? When I try the below code I get the error on line 3 of 'Too many values to unpack'. In [96]: def extract_chan_views(show_chan_views): ....: for item in show_chan_views: ....: first_index, second_index = item ....: first_sub_index, second_sub_index = second_index ....: return (first_sub_index, second_sub_index)
So the first error you're getting tells us that whatever you're trying to go through in the for loop isn't something that can be iterated through. Check what it is first. The second error means you're trying to unpack to, for example, two variables when your object has three.
I don't understand, the object should only have 2 variables as you have stated above. Also I can prove this if I run the following code .. for item in show_chan_views: ..... f_ind, s_ind = item:.....print len(s_i). This will return 2 for each.
I found the solution. To access the channel and views I need to use chan = show_chan_views[1][0] etc. as the data is presented as a 2D array.
1

I am not exactly resolving your code , but I faced same error when I applied join transformation on two datasets.

lets say , A and B are two RDDs.

c = A.join(B)

We may think that c is also Rdd , wrong. It is a tuple object where we cannot perform any split(",") kind of operations.One needs to make c into Rdd then proceed.

If we want tuple to be accessed, Lets say D is tuple.

E= D[1] // instead of E= D.split(",")[1]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.