0

I have a pySpark dataframe with many attributes in columns (there is about 160). These columns are 1s and 0s to show whether an account has an attribute or not. I need to do an analysis about the combinations of attributes, so I want to put together a sting in a new column with the names of the attributes that, that account has. Here is an example: I have these columns - account, then some other columns, then the attributes. The column I want to add is 'att_list'.

enter image description here

What I have tried is something like this:

I have the list of attributes in a variable

# create a list of all the attributes available
att_names=df1.drop('Account','other_col1','other_col1')
attlist=[x for x in att_names.columns ] 

I tried with a function - expanding an existing :

def func_att_list(df, cols=[]):
    
    att_list_column = ','.join([when(f.col(i) > 0, i) for i in cols])

    return df.withColumn('att_list', att_list_column )

df2 = func_att_list(df1, cols=[i for i in attlist])

This just errors out.

I've also tried this:

att_list_column = [when(df1.col(i) > 0, i) for i in attlist]
df1 = df1.withColumn('att_list', ','.join([i for i in att_list_column ])

This also doesnt work.

I am not confident with functions and find them a bit of a 'black box'. I would greatly appreciate any help.

2
  • Try, F.concat instead of join Commented Dec 9, 2022 at 2:06
  • Thnx for the reply - this doesn't work, it says it needs a column argument. Then I tried f.concat_ws, which gives a "Column is not iterable" error. Commented Dec 13, 2022 at 1:27

1 Answer 1

1

you could use concat_ws and pass a list of case when conditions for each attribute column - the conditions could be like if attribute column has 1 then attribute column name.

here's a small test example

# sample input creation
data_ls = [
    [random.randint(0, 1) for i in range(10)] for j in range(100)
]

data_sdf = spark.sparkContext.parallelize(data_ls). \
    toDF(['attr'+str(k) for k in range(10)])

# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# |    0|    0|    1|    1|    1|    1|    0|    0|    0|    0|
# |    1|    1|    0|    0|    1|    1|    1|    1|    1|    1|
# |    0|    1|    0|    1|    0|    0|    1|    0|    0|    0|
# |    1|    1|    0|    0|    0|    0|    0|    1|    1|    0|
# |    1|    0|    1|    0|    1|    0|    1|    1|    1|    0|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# only showing top 5 rows

# concatenate when().otherwise() for each attribute field
data_sdf. \
    withColumn('attr_list', 
               func.concat_ws(',', 
                              *[func.when(func.col(c) == 1, func.lit(c))
                                for c in data_sdf.columns if c.startswith('attr')]
                              )
               ). \
    show(5, truncate=False)

# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|attr_list                                      |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |0    |0    |1    |1    |1    |1    |0    |0    |0    |0    |attr2,attr3,attr4,attr5                        |
# |1    |1    |0    |0    |1    |1    |1    |1    |1    |1    |attr0,attr1,attr4,attr5,attr6,attr7,attr8,attr9|
# |0    |1    |0    |1    |0    |0    |1    |0    |0    |0    |attr1,attr3,attr6                              |
# |1    |1    |0    |0    |0    |0    |0    |1    |1    |0    |attr0,attr1,attr7,attr8                        |
# |1    |0    |1    |0    |1    |0    |1    |1    |1    |0    |attr0,attr2,attr4,attr6,attr7,attr8            |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# only showing top 5 rows

the list comprehension would result in the following

[Column<'CASE WHEN (attr0 = 1) THEN attr0 END'>,
 Column<'CASE WHEN (attr1 = 1) THEN attr1 END'>,
 Column<'CASE WHEN (attr2 = 1) THEN attr2 END'>,
 Column<'CASE WHEN (attr3 = 1) THEN attr3 END'>,
 Column<'CASE WHEN (attr4 = 1) THEN attr4 END'>,
 Column<'CASE WHEN (attr5 = 1) THEN attr5 END'>,
 Column<'CASE WHEN (attr6 = 1) THEN attr6 END'>,
 Column<'CASE WHEN (attr7 = 1) THEN attr7 END'>,
 Column<'CASE WHEN (attr8 = 1) THEN attr8 END'>,
 Column<'CASE WHEN (attr9 = 1) THEN attr9 END'>]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.