0

I am looking for help on Pyspark dataframe with a column such as this one:

Value
id1:xxx1xxx
id2:666x666
id1:xxx4xxx||id2:555x555||id1:xxx5xxx

Want I want to create is an additional column in which these values are in an struct array. Purpose of this is to match with values with another dataframe. I hope this question makes sense in any way.

Have been able to convert the dataframe column into the following:

["id1:xxx7xxx", "id2:777l777", "id1:999xx99"]

Any suggestions on how to convert this into a structured array?

thanks

3
  • Can you share your intended format? Also, in the 3rd row I see that "id1" is present twice. Is that on purpose? Commented Oct 5, 2021 at 15:06
  • Thanks for your help. What I wanted was a structured array with labelled elements. So basically all the ID types listed in the structured array. Something like this: [{"id1":["fff6x666","555k999"],"id2":6666,"id3":["kkk7kkk",666k999] Commented Oct 5, 2021 at 16:06
  • hope above makes a bit more sense now? Commented Oct 5, 2021 at 16:10

1 Answer 1

2

The above code produces an output similar to the one you specified, but using maps as the names needed to be dynamic. Please note if you are only going to have a small finite number of IDs that you know beforehand (id1, id2, id3) then I might approach this slightly different. Also note the output is slightly different from what you specified because if there is only 1 ID, you will have a list with 1 item, but I am not sure it's possible to have it the way you specified as you would be asking for 2 different "types" as values (list if > 1 val, string if only one value) which would cause problems anyway.

I could have done this in fewer steps, but wanted to show you the thought process and walk you through it.

from pyspark.sql import functions as F

df = spark.createDataFrame([("id1:xxx7xxx", ), ("id2:777l777", ), ("id1:xxx4xxx||id2:555x555||id1:xxx5xxx",  )], ["Value"])

# split the string based on the || and then explode 
# the reason i am keeping the Value is because we will want to use it to group by to get the split_val back to their original rows - if there are other columns you can use for the group by, you do not need to keep it 
df = df.select("Value", F.explode(F.split(F.col("Value"), "\|\|")).alias('split_val'))

df = df.withColumn("id_num", F.split(F.col("split_val"), ":").getItem(0)) \
.withColumn("id_val", F.split(F.col("split_val"), ":").getItem(1))

df = df.groupBy(["Value", "id_num"]).agg(F.collect_list("id_val").alias("id_val_list")) \
.withColumn("idMap", F.create_map(F.col("id_num"), F.col("id_val_list")))

# now group by original value to get this back in one row per Value
df = df.groupBy(["Value"]).agg(F.collect_list("idMap").alias("ValueList"))

# If you don't want Value anymore, you can just select Value List and rename it to Value
df = df.select(F.col("ValueList").alias("Value"))
Sign up to request clarification or add additional context in comments.

3 Comments

Will study your suggestion tomorrow when I have a bit more time to work on this. But at this sight this looks helpful. Thanks!
Thank you, if it satisfies your requirements please accept the answer at your convenience.
Yes, it works. First step of your suggestion I used for my workaround for this. But the latter part makes this dataframe more accurate and clean. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.