(PySpark) Nested lists after reduceByKey

Question

I'm sure this is something very simple but I didn't find anything related to this.

My code is simple:

... 
stream = stream.map(mapper) 
stream = stream.reduceByKey(reducer) 
...

Nothing extraordinary.The output looks like this:

... 
key1  value1 
key2  [value2, value3] 
key3  [[value4, value5], value6] 
...

And so on. So, sometimes I got a flat value (if it's single). Sometimes - nested lists that might be really, really deep (on my simple test data it was 3 levels deep).

I tried searching through sources for something like 'flat' - but found only flatMap method which is (as I understand it) not what I need.

I don't know why those lists are nested. My guess is that they were handled by different processes (workers?) and then joined together without flattening.

Of course, I can write a code in Python which will unfold that list and flatten it. But I believe this is not a normal situation - I think almost everybody needs a flat output.

itertools.chain stops unfolding on fist found non-iterable value. In other words, it still needs some coding (previous paragraph).

So - how to flatten the list using PySpark's native methods?

Thanks

What's your reduce function (reducer)?

Josh Rosen
– Josh Rosen

2014-01-12 18:45:39 +00:00
Commented Jan 12, 2014 at 18:45 — Josh Rosen
– Josh Rosen, Commented Jan 12, 2014 at 18:45
@JoshRosen just "return [key, value]"

Spaceman
– Spaceman

2014-01-13 05:20:25 +00:00
Commented Jan 13, 2014 at 5:20 — Spaceman
– Spaceman, Commented Jan 13, 2014 at 5:20

Josh Rosen · Accepted Answer · 2014-01-13 07:38:27Z

5

The problem here is your reduce function. For each key, reduceByKey calls your reduce function with pairs of values and expects it to produce combined values of the same type.

For example, say that I wanted to perform a word count operation. First, I can map each word to a (word, 1) pair, then I can reduceByKey(lambda x, y: x + y) to sum up the counts for each word. At the end, I'm left with an RDD of (word, count) pairs.

Here's an example from the PySpark API Documentation:

>>> from operator import add
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(add).collect())
[('a', 2), ('b', 1)]

To understand why your example didn't work, you can imagine the reduce function being applied something like this:

reduce(reduce(reduce(firstValue, secondValue), thirdValue), fourthValue) ...

Based on your reduce function, it sounds like you might be trying to implement the built-in groupByKey operation, which groups each key with a list of its values.

Also, take a look at combineByKey, a generalization of reduceByKey() that allows the reduce function's input and output types to differ (reduceByKey is implemented in terms of combineByKey)

answered Jan 13, 2014 at 7:38

Josh Rosen

13.9k6 gold badges61 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Spaceman Over a year ago

ouch... I must say, Spark's approach differs from many MR frameworks. It takes some time to port some working MRJob or Disco code there.

tianzi · Accepted Answer · 2016-01-21 21:54:14Z

1

Alternatively, stream.groupByKey().mapValues(lambda x: list(x)).collect() gives

key1 [value1]
key2 [value2, value3]
key3 [value4, value5, value6]

edited Jan 21, 2016 at 21:54

answered Jan 2, 2016 at 19:44

tianzi

3052 silver badges7 bronze badges

2 Comments

Roman Over a year ago

or just .groupByKey().mapValues(list)

tianzi Over a year ago

or .reduceByKey(lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else [b])).collect()

Collectives™ on Stack Overflow

(PySpark) Nested lists after reduceByKey

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related