0

In my Spark application I have a dataframe with informations like

+------------------+---------------+
|      labels      | labels_values |
+------------------+---------------+
| ['l1','l2','l3'] |           000 |
| ['l3','l4','l5'] |           100 |
+------------------+---------------+

What I am trying to achieve is to create, given a label name as input a single_label_value column that takes the value for that label from the labels_values column.

For example, for label='l3' I would like to retrieve this output:

+------------------+---------------+--------------------+
|      labels      | labels_values | single_label_value |
+------------------+---------------+--------------------+
| ['l1','l2','l3'] |           000 |                  0 |
| ['l3','l4','l5'] |           100 |                  1 |
+------------------+---------------+--------------------+

Here's what I am attempting to use:

selected_label='l3'
label_position = F.array_position(my_df.labels, selected_label)
my_df= my_df.withColumn(
    "single_label_value", 
    F.substring(my_df.labels_values, label_position, 1)
)

But I am getting an error because the substring function does not like the label_position argument.

Is there any way to combine these function outputs without writing an udf?

4
  • Can you explain the column single lebel value, how this 0 value is coming in first row? What exactly you are trying to do with substring here? Commented Mar 18, 2020 at 12:43
  • @ManuGupta given a column containing an array and a column containing a string, I want to check the position of an element in the array and use it to extract the nth character from the string. Commented Mar 18, 2020 at 12:48
  • Understood, The values, you mentioned was not exactly giving this location idea, since in first row index of L3 is 2 and in second row it is 0. The values, you are showing are o and 1. Let me work on solution Commented Mar 18, 2020 at 12:55
  • use SQL syntax of the function, check this faq: stackoverflow.com/questions/51140470/… Commented Mar 18, 2020 at 13:56

2 Answers 2

1

Hope, this will work for you.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark=SparkSession.builder.getOrCreate()
mydata=[[['l1','l2','l3'],'000'], [['l3','l4','l5'],'100']]

df = spark.createDataFrame(mydata,schema=["lebels","lebel_values"]) 

selected_label='l3'

df2=df.select(
        "*",
        (array_position(df.lebels,selected_label)-1).alias("pos_val"))

df2.createOrReplaceTempView("temp_table")

df3=spark.sql("select *,substring(lebel_values,pos_val,1) as val_pos from temp_table")

df3.show()


+------------+------------+-------+-------+
|      lebels|lebel_values|pos_val|val_pos|
+------------+------------+-------+-------+
|[l1, l2, l3]|         000|      2|      0|
|[l3, l4, l5]|         100|      0|      1|
+------------+------------+-------+-------+

This is giving location of the value. If you want exact index then you can use -1 from this value.

--Edited anser -> Worked with temp view. Still looking for solution using withColumn option. I hope, it will help you for now. Edit2 -> Answer using dataframe.

df2=df.select(
        "*",
        (array_position(df.lebels,selected_label)-1).astype("int").alias("pos_val")
        )

df3=df2.withColumn("asked_col",expr("substring(lebel_values,pos_val,1)"))
df3.show()
Sign up to request clarification or add additional context in comments.

3 Comments

But then how can I use that label_position to get 0 or 1?
I hope, I understood the requirement now. lebel_values that position is needed. Let me change.
I hope, that concludes the ans. Pls check.
0

Try maybe:

import pyspark.sql.functions as f
from pyspark.sql.functions import *

selected_label='l3'
df=df.withColumn('single_label_value', f.substring(f.col('labels_values'), array_position(f.col('labels'), lit(selected_label))-1, 1))

df.show()

(for spark version >=2.4)

I think lit() was the function you were missing - you can use it to pass constant values to spark dataframes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.