Retrieve array stored in the dataframe column for each row using scala apark

Question

Following dataframe belongs to me

+-------------------------------+-------------------------+
|value                          |feeling                  |
+-------------------------------+-------------------------+
|Sam got these marks            |[sad, sad, dissappointed ]|
|I got good marks               |[happy, excited, happy]   |
+-------------------------------+-------------------------+

I want to iterate through this dataframe and get the array of marks column per each row and use the marks array for some calculation method.

def calculationMethod(arrayValue : Array[String]) {
//get averege of words
}

output dataframe

  +-------------------------------+-----------------------------+--------------
    |value                          |feeling                   |average       |
    +-------------------------------+-----------------------------------------+
    |Sam got these marks            |[sad, sad, dissappointed ]|sad           |
    |I got good marks               |[happy, excited, happy]   |happy         |
    +-------------------------------+-----------------------------------------+

I am not sure how I can iterate through each row and get the array in the second column that can be passed into my written method. Also please note that the dataframe shown in the question is a stream dataframe.

EDIT 1

val calculateUDF = udf(calculationMethod _)
    val editedDataFrame = filteredDataFrame.withColumn("average", calculateUDF(col("feeling"))) 

def calculationMethod(emojiArray: Seq[String]) : DataFrame {
val existingSparkSession = SparkSession.builder().getOrCreate()
    import existingSparkSession.implicits._
    val df = emojiArray.toDF("feeling")
    val result = df.selectExpr(
      "feeling",
      "'U+' || trim('0' , string(hex(encode(feeling, 'utf-32')))) as unicode"
    )
    result
}

I'm getting the following error

Schema for type org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] is not supported

Please note that initial dataframe mentioned in the question is a stream dataframe

EDIT 2

This should be the final dataframe that I am expecting

    +-------------------+--------------+-------------------------+
    |value              |feeling       |unicode                  |
    +-------------------+--------------+-------------------------+
    |Sam got these marks|[😀😆😁]     |[U+1F600 U+1F606 U+1F601]|
    |I got good marks   |[😄🙃]        | [U+1F604 U+1F643 ]      |
    +-------------------+---------------+-------------------------+

mck · Accepted Answer · 2021-03-23 13:00:34Z

1

You can transform the arrays instead of using a UDF:

val df2 = df.withColumn(
    "unicode", 
    expr("transform(feeling, x -> 'U+' || ltrim('0' , string(hex(encode(x, 'utf-32')))))")
)

df2.show(false)
+-------------------+------------+---------------------------+
|value              |feeling     |unicode                    |
+-------------------+------------+---------------------------+
|Sam got these marks|[😀, 😆, 😁]|[U+1F600, U+1F606, U+1F601]|
|I got good marks   |[😄, 🙃]    |[U+1F604, U+1F643]         |
+-------------------+------------+---------------------------+

edited Mar 23, 2021 at 13:00

answered Mar 22, 2021 at 17:25

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

tharindu Over a year ago

im getting the error "Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] is not supported"

mck Over a year ago

@tharindu it's not possible to call methods on a spark dataframe inside a UDF. You can consider using transform on the array. If you could provide an example dataframe I can give it a try.

tharindu Over a year ago

I have provided the expected dataframe under the edit 2

tharindu Over a year ago

I am not getting any output. No error is thrown but there is nothing printed in the console.

tharindu Over a year ago

Sorry there was some network issue thats why. Its working properly. thanks a lot for the help!!

|

Collectives™ on Stack Overflow

Retrieve array stored in the dataframe column for each row using scala apark

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related