I'm looking for a way to convert a given column of data, in this case strings, and convert them into a numeric representation. For example, I have a dataframe of strings with values:
+------------+
| level |
+------------+
| Medium|
| Medium|
| Medium|
| High|
| Medium|
| Medium|
| Low|
| Low|
| High|
| Low|
| Low|
And I want to create a new column where these values get converted to:
"High"= 1, "Medium" = 2, "Low" = 3
+------------+
| level_num|
+------------+
| 2|
| 2|
| 2|
| 1|
| 2|
| 2|
| 3|
| 3|
| 1|
| 3|
| 3|
I've tried defining a function and doing a foreach over the dataframe like so:
def f(x):
if(x == 'Medium'):
return 2
elif(x == "Low"):
return 3
else:
return 1
a = df.select("level").rdd.foreach(f)
But this returns a "None" type. Thoughts? Thanks for the help as always!