Converting Array of Strings to String with different delimiters in Spark Scala

Question

I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.

Examples below:

Schema:

root
 |-- carLineName: array (nullable = true)
 |    |-- element: string (containsNull = true)

Input as Dataframe:

   +--------------------+
    |carLineName         |
    +--------------------+
    |[Avalon,CRV,Camry]  |
    |[Model T, Model S]  |
    |[Cayenne, Mustang]  |
    |[Pilot, Jeep]       |

Desired output:

+--------------------+
|carLineName         |
+--------------------+
|Avalon;#CRV;#Camry  | 
|Model T;#Model S    |
|Cayenne;#Mustang    |
|Pilot;# Jeep        |

Current code which produces the input above:

val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))

Provided the line I used to get the input. Converting the array type to String. I can easily parse the commas out of the dataframe column as a String but the reason for changing the delimiter is to prevent free form text with commas coming in, so looking for a solution to possible map the array of Strings into String delimited. — Defcon
– Defcon, Commented Sep 11, 2019 at 15:37
Yes an Array of Strings (it is being read in as a dataframe) — Defcon
– Defcon, Commented Sep 11, 2019 at 16:15

David Vrba · Accepted Answer · 2019-09-11 19:56:30Z

4

You can use native function array_join (it is available since Spark 2.4):

import org.apache.spark.sql.functions.{array_join}

val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")

df.withColumn("str", array_join($"carLineName", ";#")).show()

+--------------------+------------------+
|         carLineName|               str|
+--------------------+------------------+
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
|  [Model T, Model S]|  Model T;#Model S|
|  [Cayenne, Mustang]|  Cayenne;#Mustang|
|       [Pilot, Jeep]|       Pilot;#Jeep|
+--------------------+------------------+

answered Sep 11, 2019 at 19:56

David Vrba

3,34417 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Defcon Over a year ago

Wow this would have been nice. Unfortunately, still on Spark 1.6.

firsni · Accepted Answer · 2019-09-11 16:32:41Z

3

you can create a user defined function that concatenate elements with "#;" separator as the following example:

val df1  = Seq(
  ("1", Array("t1", "t2")),
  ("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")

import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))


+---+------------+----------+
| id|         arr|  formated|
+---+------------+----------+
|  1|    [t1, t2]|    t1#;t2|
|  2|[t1, t3, t5]|t1#;t3#;t5|
+---+------------+----------+

answered Sep 11, 2019 at 16:32

firsni

9266 silver badges14 bronze badges

3 Comments

Defcon Over a year ago

If I want to do this for multiple columns, what do you recommend?

firsni Over a year ago

you apply a foldLeft to the list of columns and you put your dataframe as 0 element columnsList.foldLeft(df){case(data, colName) => data.withColumn(colName ++ "formated", udfFormat(col(colName)))}

Defcon Over a year ago

If I have a senario like [null,null,null], do you have any recommendations on how to handle that?

hagarwal · Accepted Answer · 2019-09-11 16:27:20Z

You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.

import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
  carLineName.mkString(";#")
})
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))

This udf could be made generic further by passing the delimiter as the second parameter.

import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
  carLineName.mkString(delimiter)
})
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))

SanBan · Accepted Answer · 2019-09-12 01:07:40Z

Since you are using 1.6, we can do simple map of Row to WrappedArray.

Here is how it goes.

Input :

scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
                   |                   (Array("Model T", "Model S")),
                   |                   (Array("Cayenne", "Mustang")),
                   |                   (Array("Pilot", "Jeep"))
                   |                   ).toDF("carLineName")
              carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]

Schema ::

              scala> carLineDf.printSchema
              root
               |-- carLineName: array (nullable = true)
               |    |-- element: string (containsNull = true)

Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :

scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray

scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
+-------------------+
|carLineNameAsString|
+-------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S   |
|Cayenne;#Mustang   |
|Pilot;#Jeep        |
+-------------------+

// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)

That's it. You might have to use a dataframe.rdd otherwise this should do.

Collectives™ on Stack Overflow

Converting Array of Strings to String with different delimiters in Spark Scala

4 Answers 4

1 Comment

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related