0

I have a dataframe like this where a column have alphanumerical values. Now I want to map all those values to some integer and also have some dictionary where I have those mappings so I can use them later. They're not all unique values.

column
  a
  c1
  3vc
  c1
  .
  .
  .
  n

Output:

column
  1
  2
  3
  2
  .
  .
  .
  n

1 Answer 1

3

If there is no special requirement for the mapping order, you can directly sort the rankings according to the character order to map to integers.

data = [
    ('a',),
    ('c1',),
    ('3vc',),
    ('a',),
    ('c1',),
    ('fd',)
]
df = spark.createDataFrame(data, ['column'])
df = df.selectExpr('column', 'dense_rank() over (order by column) as int_col') \
    .withColumn('mapping', F.expr('map(column, int_col)'))
df.show(truncate=False)

# +------+-------+----------+
# |column|int_col|mapping   |
# +------+-------+----------+
# |3vc   |1      |{3vc -> 1}|
# |a     |2      |{a -> 2}  |
# |a     |2      |{a -> 2}  |
# |c1    |3      |{c1 -> 3} |
# |c1    |3      |{c1 -> 3} |
# |fd    |4      |{fd -> 4} |
# +------+-------+----------+
Sign up to request clarification or add additional context in comments.

2 Comments

Thnaks man, this was really helpful!
Hey, this does change my partition to just 1, and I can repartition later but that's creating problem while saving as rank step has just one partition and it's not dividing the stage into multiple task. Is there any other way I can do it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.