0

I am new to pyspark and I want to replace names with numbers in a pyspark dataframe column dynamically because I have more than 5,00,000 names in my dataframe. How to proceed?

----------
| Name   |
----------
| nameone|
----------
| nametwo|
----------

should become

--------
| Name |
--------
|   1  |
--------
|   2  |
--------
2
  • 4
    What have you tried so far and how is it not working? Commented Jul 25, 2019 at 6:40
  • there is one function called row_number() you can make use of it here. Go through it once. Commented Jul 25, 2019 at 6:50

1 Answer 1

1

Well you have two options I can think of. In case you have only unique names, you can simply apply the monotonically_increasing_id function. This will create an unique but not consecutive id for each row.

import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer

l = [
('nameone', ),
('nametwo', ),
('nameone', )
]

columns = ['Name']

df=spark.createDataFrame(l, columns)
#use Name instead of uniqueId to overwrite the column
df = df.withColumn('uniqueId', F.monotonically_increasing_id())
df.show()

Output:

+-------+----------+ 
|   Name|  uniqueId| 
+-------+----------+ 
|nameone|         0| 
|nametwo|8589934592| 
|nameone|8589934593| 
+-------+----------+

In case you want to assign the same id to rows which have the same value for Name, you have to use a StringIndexer:

indexer = StringIndexer(inputCol="Name", outputCol="StringINdex")
df = indexer.fit(df).transform(df)
df.show()

Output:

+-------+----------+-----------+ 
|   Name|  uniqueId|StringINdex| 
+-------+----------+-----------+ 
|nameone|         0|        0.0| 
|nametwo|8589934592|        1.0| 
|nameone|8589934593|        0.0| 
+-------+----------+-----------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.