Creating a struct array from a pyspark dataframe column

Question

I am looking for help on Pyspark dataframe with a column such as this one:

Value
id1:xxx1xxx
id2:666x666
id1:xxx4xxx\|\|id2:555x555\|\|id1:xxx5xxx

Want I want to create is an additional column in which these values are in an struct array. Purpose of this is to match with values with another dataframe. I hope this question makes sense in any way.

Have been able to convert the dataframe column into the following:

["id1:xxx7xxx", "id2:777l777", "id1:999xx99"]

Any suggestions on how to convert this into a structured array?

thanks

Can you share your intended format? Also, in the 3rd row I see that "id1" is present twice. Is that on purpose? — WIT
– WIT, Commented Oct 5, 2021 at 15:06
Thanks for your help. What I wanted was a structured array with labelled elements. So basically all the ID types listed in the structured array. Something like this: [{"id1":["fff6x666","555k999"],"id2":6666,"id3":["kkk7kkk",666k999] — Antonius
– Antonius, Commented Oct 5, 2021 at 16:06

WIT · Accepted Answer · 2021-10-05 20:20:48Z

2

The above code produces an output similar to the one you specified, but using maps as the names needed to be dynamic. Please note if you are only going to have a small finite number of IDs that you know beforehand (id1, id2, id3) then I might approach this slightly different. Also note the output is slightly different from what you specified because if there is only 1 ID, you will have a list with 1 item, but I am not sure it's possible to have it the way you specified as you would be asking for 2 different "types" as values (list if > 1 val, string if only one value) which would cause problems anyway.

I could have done this in fewer steps, but wanted to show you the thought process and walk you through it.

from pyspark.sql import functions as F

df = spark.createDataFrame([("id1:xxx7xxx", ), ("id2:777l777", ), ("id1:xxx4xxx||id2:555x555||id1:xxx5xxx",  )], ["Value"])

# split the string based on the || and then explode 
# the reason i am keeping the Value is because we will want to use it to group by to get the split_val back to their original rows - if there are other columns you can use for the group by, you do not need to keep it 
df = df.select("Value", F.explode(F.split(F.col("Value"), "\|\|")).alias('split_val'))

df = df.withColumn("id_num", F.split(F.col("split_val"), ":").getItem(0)) \
.withColumn("id_val", F.split(F.col("split_val"), ":").getItem(1))

df = df.groupBy(["Value", "id_num"]).agg(F.collect_list("id_val").alias("id_val_list")) \
.withColumn("idMap", F.create_map(F.col("id_num"), F.col("id_val_list")))

# now group by original value to get this back in one row per Value
df = df.groupBy(["Value"]).agg(F.collect_list("idMap").alias("ValueList"))

# If you don't want Value anymore, you can just select Value List and rename it to Value
df = df.select(F.col("ValueList").alias("Value"))

answered Oct 5, 2021 at 20:20

WIT

1,0834 gold badges17 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Antonius Over a year ago

Will study your suggestion tomorrow when I have a bit more time to work on this. But at this sight this looks helpful. Thanks!

WIT Over a year ago

Thank you, if it satisfies your requirements please accept the answer at your convenience.

Antonius Over a year ago

Yes, it works. First step of your suggestion I used for my workaround for this. But the latter part makes this dataframe more accurate and clean. Thanks

Collectives™ on Stack Overflow

Creating a struct array from a pyspark dataframe column

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related