In Pandas, one can do an operation like this:
mapping = {
'a': 'The letter A',
'b': 'The letter B',
'c': 'The third letter'
}
x = pd.Series(['a', 'b', 'a', c']).map(mapping)
and obtain something like
pd.Series([
'The letter A',
'The letter B',
'The letter A',
'The third letter'
])
Naively, I can achieve this in a PySpark DataFrame with something like
import pyspark.sql.functions as F
import pyspark.sql.functions as T
def _map_values_str(value, mapping, default=None):
""" Apply a mapping, assuming the result is a string """
return mapping.get(value, default)
map_values_str = F.udf(_map_values_str, T.StringType())
mapping = {
'a': 'The letter A',
'b': 'The letter B',
'c': 'The third letter'
}
data = spark.createDataFrame([('a',), ('b',), ('a',), ('c',)], schema=['letters'])
data = data.withColumn('letters_mapped', map_values_str(F.col('letters'), mapping))
But UDFs like this tend to be somewhat slow on large data sets in my experience. Is there a more efficient way?