1

I have a RDD like this:

rdd = sc.parallelize(['a','b','a','c','d','b','e'])

I want to create a map(dictionary) of each unique value to an index.

The output will be a map (key, value) like:

{'a':0, 'b':1, 'c':2,'d':3,'e':4}

It's super easy to do in Python but I don't know how to do this in Spark.

0

2 Answers 2

1

What you are looking for is zipWithIndex

So for your example (The "sort" part is only to get a to be 0 and so on):

rdd = sc.parallelize(['a','b','a','c','d','b','e'])

print rdd.distinct().sortBy(lambda x: x).zipWithIndex().collectAsMap()

{'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3}

Sign up to request clarification or add additional context in comments.

Comments

1

If you can accept gaps this should do the trick:

rdd.zipWithIndex().reduceByKey(min).collectAsMap()
# {'b': 1, 'c': 3, 'a': 0, 'e': 6, 'd': 4}

Otherwise (much more expensive)

(rdd
    .zipWithIndex()
    .reduceByKey(min)
    .sortBy(lambda x: x[1])
    .keys()
    .zipWithIndex()
    .collectAsMap())
# {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

2 Comments

this looks fine. but what do you mean by gaps?
The first solution behaves like sparse rank, the second (which gives the result you seem to expect), more like dense rank (note that e is mapped to 6 - index in the original input, not 4 - index in RDD after dropping duplicates).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.