Scala: For loop on dataframe, create new column from existing by index

Question

I have a dataframe with two columns:

id (string), date (timestamp)

I would like to loop through the dataframe, and add a new column with an url, which includes the id. The algorithm should look something like this:

 add one new column with the following value:
 for each id
       "some url" + the value of the dataframe's id column

I tried to make this work in Scala, but I have problems with getting the specific id on the index of "a"

 val k = df2.count().asInstanceOf[Int]
      // for loop execution with a range
      for( a <- 1 to k){
         // println( "Value of a: " + a );
         val dfWithFileURL = dataframe.withColumn("fileUrl", "https://someURL/" + dataframe("id")[a])

      }

But this

dataframe("id")[a]

is not working with Scala. I could not find solution yet, so every kind of suggestions are welcome!

Do you even need a loop? df2.withColumn("fileUrl", "https://someURL/" + $"id") might work? — zacdav
– zacdav, Commented Feb 26, 2019 at 10:48
You want to add as many columns as there is rows ? This won't scale at all and if you don't need it to be scalable, there is not reason to use a Spark DataFrame... — eliasah
– eliasah, Commented Feb 26, 2019 at 10:55
So the solution of @zacdav throws the following error: error: type mismatch; found : String required: org.apache.spark.sql.Column I'm working in a Databricks notebook, maybe it works in a different way? I try to investigate this too... — Eve
– Eve, Commented Feb 26, 2019 at 16:30
By the way, @eliasah, no, I want to add one column only with a url, which has the id added to the link, the id comes from the first column. — Eve
– Eve, Commented Feb 26, 2019 at 16:30
Then your pseudo-code is wrong because this is what it does. — eliasah
– eliasah, Commented Feb 26, 2019 at 16:35

wBob · Accepted Answer · 2019-02-26 19:01:38Z

2

You can simply use the withColumn function in Scala, something like this:

val df = Seq(
  ( 1, "1 Jan 2000" ),
  ( 2, "2 Feb 2014" ),
  ( 3, "3 Apr 2017" )
)
  .toDF("id", "date" )


// Add the fileUrl column
val dfNew = df
  .withColumn("fileUrl", concat(lit("https://someURL/"), $"id"))
  .show

My results:

answered Feb 26, 2019 at 19:01

wBob

14.4k3 gold badges26 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

1pluszara · Accepted Answer · 2019-02-26 13:15:48Z

Not sure if this is what you require but you can use zipWithIndex for indexing.

data.show()

+---+---------------+
| Id|            Url|
+---+---------------+
|111|http://abc.go.org/|
|222|http://xyz.go.net/|
+---+---------------+   

import org.apache.spark.sql._
val df = sqlContext.createDataFrame(
data.rdd.zipWithIndex
.map{case (r, i) => Row.fromSeq(r.toSeq:+(s"""${r.getString(1)}${i+1}"""))},
    StructType(data.schema.fields :+ StructField("fileUrl", StringType, false))
)

Output:

df.show(false)

+---+---------------+----------------+
|Id |Url            |fileUrl         |
+---+---------------+----------------+
|111|http://abc.go.org/|http://abc.go.org/1|
|222|http://xyz.go.net/|http://xyz.go.net/2|
+---+---------------+----------------+

Collectives™ on Stack Overflow

Scala: For loop on dataframe, create new column from existing by index

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related