I have a pyspark dataframe where multiple columns contain arrays of different lengths. I want to iterate through the relevant columns and clip the arrays in each row so that they are the same length. In this example, length of 3.
This is an example dataframe:
id_1|id_2|id_3| timestamp |thing1 |thing2 |thing3
A |b | c |[time_0,time_1,time_2]|[1.2,1.1,2.2]|[1.3,1.5,2.6|[2.5,3.4,2.9]
A |b | d |[time_0,time_1] |[5.1,6.1, 1.4, 1.6] |[5.5,6.2, 0.2] |[5.7,6.3]
A |b | e |[time_0,time_1] |[0.1,0.2, 1.1] |[0.5,0.3, 0.3] |[0.9,0.6, 0.9, 0.4]
So far I have,
def clip_func(x, ts_len, backfill=1500):
template = [backfill]*ts_len
template[-len(x):] = x
x = template
return x[-1 * ts_len:]
clip = udf(clip_func, ArrayType(DoubleType()))
for c in [x for x in example.columns if 'thing' in x]:
missing_fill = 3.3
ans = ans.withColumn(c, clip(c, 3, missing_fill))
But is not working. If the array is too short, I want to fill the array with the missing_fill value.
clip = udf(clip_func, DoubleType())? the example in the docs usesIntegerType, notArrayType, so that would be my only suggestion on what looks wrong here.thingarrays be the same length as thetimestamparrays. Is that correct?udfwhen you should be passing column literals (pyspark.sql.functions.lit)df = df.withColumn(c, clip(col(c),lit(3),lit(missing_fill)))Atleast you won't get the error.