0

I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.

Examples below:

Schema:

root
 |-- carLineName: array (nullable = true)
 |    |-- element: string (containsNull = true)

Input as Dataframe:

   +--------------------+
    |carLineName         |
    +--------------------+
    |[Avalon,CRV,Camry]  |
    |[Model T, Model S]  |
    |[Cayenne, Mustang]  |
    |[Pilot, Jeep]       |

Desired output:

+--------------------+
|carLineName         |
+--------------------+
|Avalon;#CRV;#Camry  | 
|Model T;#Model S    |
|Cayenne;#Mustang    |
|Pilot;# Jeep        |

Current code which produces the input above:

val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))
4
  • Can you give us the code that you have currently? Commented Sep 11, 2019 at 15:31
  • Provided the line I used to get the input. Converting the array type to String. I can easily parse the commas out of the dataframe column as a String but the reason for changing the delimiter is to prevent free form text with commas coming in, so looking for a solution to possible map the array of Strings into String delimited. Commented Sep 11, 2019 at 15:37
  • So carLineName is an array? Commented Sep 11, 2019 at 15:55
  • Yes an Array of Strings (it is being read in as a dataframe) Commented Sep 11, 2019 at 16:15

4 Answers 4

4

You can use native function array_join (it is available since Spark 2.4):

import org.apache.spark.sql.functions.{array_join}

val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")

df.withColumn("str", array_join($"carLineName", ";#")).show()

+--------------------+------------------+
|         carLineName|               str|
+--------------------+------------------+
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
|  [Model T, Model S]|  Model T;#Model S|
|  [Cayenne, Mustang]|  Cayenne;#Mustang|
|       [Pilot, Jeep]|       Pilot;#Jeep|
+--------------------+------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Wow this would have been nice. Unfortunately, still on Spark 1.6.
3

you can create a user defined function that concatenate elements with "#;" separator as the following example:

val df1  = Seq(
  ("1", Array("t1", "t2")),
  ("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")

import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))


+---+------------+----------+
| id|         arr|  formated|
+---+------------+----------+
|  1|    [t1, t2]|    t1#;t2|
|  2|[t1, t3, t5]|t1#;t3#;t5|
+---+------------+----------+

3 Comments

If I want to do this for multiple columns, what do you recommend?
you apply a foldLeft to the list of columns and you put your dataframe as 0 element columnsList.foldLeft(df){case(data, colName) => data.withColumn(colName ++ "formated", udfFormat(col(colName)))}
If I have a senario like [null,null,null], do you have any recommendations on how to handle that?
0

You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.

import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
  carLineName.mkString(";#")
})
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))

This udf could be made generic further by passing the delimiter as the second parameter.

import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
  carLineName.mkString(delimiter)
})
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))

Comments

0

Since you are using 1.6, we can do simple map of Row to WrappedArray.

Here is how it goes.

Input :

scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
                   |                   (Array("Model T", "Model S")),
                   |                   (Array("Cayenne", "Mustang")),
                   |                   (Array("Pilot", "Jeep"))
                   |                   ).toDF("carLineName")
              carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]

Schema ::

              scala> carLineDf.printSchema
              root
               |-- carLineName: array (nullable = true)
               |    |-- element: string (containsNull = true)

Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :

scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray

scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
+-------------------+
|carLineNameAsString|
+-------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S   |
|Cayenne;#Mustang   |
|Pilot;#Jeep        |
+-------------------+

// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)

That's it. You might have to use a dataframe.rdd otherwise this should do.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.