0

I have an RDD of the form:

(2, [hello, hi, how, are, you])

I need to map these tuple like:

((2,hello), (2, hi), (2, how), ((2, are), (2, you))

I am trying this in python:

PairRDD = rdd.flatMap(lambda (k,v): v.split(',')).map(lambda x: (k,x)).reduceByKey())

This will not work as I do not have k in map transformation. I am not sure how to do it ? Any comments ?

Thanking you in advance.

0

1 Answer 1

1

I think your core issue is a misplaced right parens. Consider the following code (I've tested the equivalent in Scala, but it should work the same way in pySpark):

PairRDD = rdd.flatMap(lambda (k,v): v.split(',').map(lambda x: (k,x)))

v is split into a list of strings, and then that list is mapped to a tuple of (key, string), and then that list is returned to flatMap, splitting it out into multiple rows in the RDD. With the additional right parens after v.split(','), you were throwing away the key (since you only returned a list of strings).

Are the key values unique in the original dataset? If so and you want a list of tuples, then instead of flatMap use map and you'll get what you want without a shuffle. If you do want to combine multiple rows from the original dataset, then a groupByKey is called for, not reduceByKey.

I'm also curious if the split is necessary--is your tuple (Int, String) or (Int, List(String))?

Sign up to request clarification or add additional context in comments.

5 Comments

yes, I was doing wrong. Now by placing right parens appropriately i can get the values in (key, value) pair. My rdd here is a zip'ed RDD and before zipping the right part was a list. I checked its type. But now after zip, and the placing the bracket appropriately. I can access the key value. But somehow it gives error as list is not iterative while doing v.split(','). Thanks a lot for help Matthew.
@rahul If it's a list you don't need to do split; you should be able to just do v.map(lambda x: (k,x)) as split is a function that turns a string into a list, and map is a function for lists (and other similar objects).
ok..I did that. So my code looks like: PairRDD = rdd.flatMap(lambda (k,v): v.map(lambda x: (k,x)) . But it gives me error as list object have no attribute map
Error is: AttributeError: 'list' object has no attribute 'map'.
Sorry, I'm too used to Scala, where map is a function of lists rather than a function that takes a list an an input. In Python I think the cleanest way to do it is a list comprehension, so something like rdd.map(lambda (k,v): [(k, x) for x in v]) .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.