pySpark: Concatenating column names into a string into column

Question

I have a pySpark dataframe with many attributes in columns (there is about 160). These columns are 1s and 0s to show whether an account has an attribute or not. I need to do an analysis about the combinations of attributes, so I want to put together a sting in a new column with the names of the attributes that, that account has. Here is an example: I have these columns - account, then some other columns, then the attributes. The column I want to add is 'att_list'.

What I have tried is something like this:

I have the list of attributes in a variable

# create a list of all the attributes available
att_names=df1.drop('Account','other_col1','other_col1')
attlist=[x for x in att_names.columns ]

I tried with a function - expanding an existing :

def func_att_list(df, cols=[]):
    
    att_list_column = ','.join([when(f.col(i) > 0, i) for i in cols])

    return df.withColumn('att_list', att_list_column )

df2 = func_att_list(df1, cols=[i for i in attlist])

This just errors out.

I've also tried this:

att_list_column = [when(df1.col(i) > 0, i) for i in attlist]
df1 = df1.withColumn('att_list', ','.join([i for i in att_list_column ])

This also doesnt work.

I am not confident with functions and find them a bit of a 'black box'. I would greatly appreciate any help.

Thnx for the reply - this doesn't work, it says it needs a column argument. Then I tried f.concat_ws, which gives a "Column is not iterable" error. — GenDemo
– GenDemo, Commented Dec 13, 2022 at 1:27

samkart · Accepted Answer · 2022-12-09 10:35:11Z

you could use concat_ws and pass a list of case when conditions for each attribute column - the conditions could be like if attribute column has 1 then attribute column name.

here's a small test example

# sample input creation
data_ls = [
    [random.randint(0, 1) for i in range(10)] for j in range(100)
]

data_sdf = spark.sparkContext.parallelize(data_ls). \
    toDF(['attr'+str(k) for k in range(10)])

# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# |    0|    0|    1|    1|    1|    1|    0|    0|    0|    0|
# |    1|    1|    0|    0|    1|    1|    1|    1|    1|    1|
# |    0|    1|    0|    1|    0|    0|    1|    0|    0|    0|
# |    1|    1|    0|    0|    0|    0|    0|    1|    1|    0|
# |    1|    0|    1|    0|    1|    0|    1|    1|    1|    0|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# only showing top 5 rows

# concatenate when().otherwise() for each attribute field
data_sdf. \
    withColumn('attr_list', 
               func.concat_ws(',', 
                              *[func.when(func.col(c) == 1, func.lit(c))
                                for c in data_sdf.columns if c.startswith('attr')]
                              )
               ). \
    show(5, truncate=False)

# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|attr_list                                      |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |0    |0    |1    |1    |1    |1    |0    |0    |0    |0    |attr2,attr3,attr4,attr5                        |
# |1    |1    |0    |0    |1    |1    |1    |1    |1    |1    |attr0,attr1,attr4,attr5,attr6,attr7,attr8,attr9|
# |0    |1    |0    |1    |0    |0    |1    |0    |0    |0    |attr1,attr3,attr6                              |
# |1    |1    |0    |0    |0    |0    |0    |1    |1    |0    |attr0,attr1,attr7,attr8                        |
# |1    |0    |1    |0    |1    |0    |1    |1    |1    |0    |attr0,attr2,attr4,attr6,attr7,attr8            |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# only showing top 5 rows

the list comprehension would result in the following

[Column<'CASE WHEN (attr0 = 1) THEN attr0 END'>,
 Column<'CASE WHEN (attr1 = 1) THEN attr1 END'>,
 Column<'CASE WHEN (attr2 = 1) THEN attr2 END'>,
 Column<'CASE WHEN (attr3 = 1) THEN attr3 END'>,
 Column<'CASE WHEN (attr4 = 1) THEN attr4 END'>,
 Column<'CASE WHEN (attr5 = 1) THEN attr5 END'>,
 Column<'CASE WHEN (attr6 = 1) THEN attr6 END'>,
 Column<'CASE WHEN (attr7 = 1) THEN attr7 END'>,
 Column<'CASE WHEN (attr8 = 1) THEN attr8 END'>,
 Column<'CASE WHEN (attr9 = 1) THEN attr9 END'>]

Collectives™ on Stack Overflow

pySpark: Concatenating column names into a string into column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related