Mapping values in a (Py)Spark DataFrame

Question

In Pandas, one can do an operation like this:

mapping = {
    'a': 'The letter A',
    'b': 'The letter B',
    'c': 'The third letter'
}

x = pd.Series(['a', 'b', 'a', c']).map(mapping)

and obtain something like

pd.Series([
    'The letter A',
    'The letter B',
    'The letter A',
    'The third letter'
])

Naively, I can achieve this in a PySpark DataFrame with something like

import pyspark.sql.functions as F
import pyspark.sql.functions as T

def _map_values_str(value, mapping, default=None):
    """ Apply a mapping, assuming the result is a string """
    return mapping.get(value, default)

map_values_str = F.udf(_map_values_str, T.StringType())

mapping = {
    'a': 'The letter A',
    'b': 'The letter B',
    'c': 'The third letter'
}

data = spark.createDataFrame([('a',), ('b',), ('a',), ('c',)], schema=['letters'])
data = data.withColumn('letters_mapped', map_values_str(F.col('letters'), mapping))

But UDFs like this tend to be somewhat slow on large data sets in my experience. Is there a more efficient way?

Florian · Accepted Answer · 2018-08-02 06:37:43Z

3

I think in this case you could convert the dict to a DataFrame and simply use a join:

import pyspark.sql.functions as F

mapping = {
    'a': 'The letter A',
    'b': 'The letter B',
    'c': 'The third letter'
}
# Convert so Spark DataFrame
mapping_df = spark.sparkContext.parallelize([(k,)+(v,) for k,v in mapping.items()]).toDF(['letters','val'])

data = spark.createDataFrame([('a',), ('b',), ('a',), ('c',)], schema=['letters'])
data = data.join(mapping_df.withColumnRenamed('val','letters_mapped'),'letters','left')
data.show()

Output:

+-------+----------------+
|letters|  letters_mapped|
+-------+----------------+
|      c|The third letter|
|      b|    The letter B|
|      a|    The letter A|
|      a|    The letter A|
+-------+----------------+

Hope this helps!

answered Aug 2, 2018 at 6:37

Florian

25.6k5 gold badges56 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Mapping values in a (Py)Spark DataFrame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related