pyspark concat string by partition

Question

I have a dataframe like

+----+----------+
|id  | device   |
+----+----------+
| 123| phone    |
| 124| phone    |
| 555| phone    |
| 898| tablet   |
| 999| tablet   |
|1111| tv       |
+----+----------+

and I'm looking to get a new columns with devices value associate by an id like

+----+----------+--------------+
|id  | device   | device_id    |
+----+----------+--------------+
| 123| phone    | phone_00001  |
| 124| phone    | phone_00002  |
| 555| phone    | phone_00003  |
| 898| tablet   | tablet_00001 |
| 999| tablet   | tablet_00002 |
|1111| tv       | tv_00001     |
+----+----------+--------------+

in R it would look like

df %>% group_by(device) %>% mutate(device_id = paste0(device, '_', sprintf("%04d", row_number())

I'm looking for the same in pyspark.

mck · Accepted Answer · 2021-04-02 08:26:44Z

1

A similar approach as in R, where you assign row numbers based on device partitions, and use format_string to get the desired output format:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'device_id', 
    F.format_string(
        '%s_%05d', 
        F.col('device'), 
        F.row_number().over(Window.partitionBy('device').orderBy('id'))
    )
)

df2.show()
+----+------+------------+
|  id|device|   device_id|
+----+------+------------+
| 123| phone| phone_00001|
| 124| phone| phone_00002|
| 555| phone| phone_00003|
|1111|    tv|    tv_00001|
| 898|tablet|tablet_00001|
| 999|tablet|tablet_00002|
+----+------+------------+

answered Apr 2, 2021 at 8:26

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark concat string by partition

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related