Replacing Strings with numbers in a pyspark dataframe

Question

I am new to pyspark and I want to replace names with numbers in a pyspark dataframe column dynamically because I have more than 5,00,000 names in my dataframe. How to proceed?

----------
| Name   |
----------
| nameone|
----------
| nametwo|
----------

should become

--------
| Name |
--------
|   1  |
--------
|   2  |
--------

there is one function called row_number() you can make use of it here. Go through it once. — Prathik Kini
– Prathik Kini, Commented Jul 25, 2019 at 6:50

cronoik · Accepted Answer · 2019-07-25 13:33:03Z

Well you have two options I can think of. In case you have only unique names, you can simply apply the monotonically_increasing_id function. This will create an unique but not consecutive id for each row.

import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer

l = [
('nameone', ),
('nametwo', ),
('nameone', )
]

columns = ['Name']

df=spark.createDataFrame(l, columns)
#use Name instead of uniqueId to overwrite the column
df = df.withColumn('uniqueId', F.monotonically_increasing_id())
df.show()

Output:

+-------+----------+ 
|   Name|  uniqueId| 
+-------+----------+ 
|nameone|         0| 
|nametwo|8589934592| 
|nameone|8589934593| 
+-------+----------+

In case you want to assign the same id to rows which have the same value for Name, you have to use a StringIndexer:

indexer = StringIndexer(inputCol="Name", outputCol="StringINdex")
df = indexer.fit(df).transform(df)
df.show()

Output:

+-------+----------+-----------+ 
|   Name|  uniqueId|StringINdex| 
+-------+----------+-----------+ 
|nameone|         0|        0.0| 
|nametwo|8589934592|        1.0| 
|nameone|8589934593|        0.0| 
+-------+----------+-----------+

Collectives™ on Stack Overflow

Replacing Strings with numbers in a pyspark dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related