Spark dataframe get column value into a string variable

Question

I am trying extract column value into a variable so that I can use the value somewhere else in the code. I am trying like the following

 val name= test.filter(test("id").equalTo("200")).select("name").col("name")

It returns

 name org.apache.spark.sql.Column = name

how to get the value?

Upvoting because of how often I make the same silly mistake in a completely different language. — Joshua
– Joshua, Commented Aug 24, 2018 at 1:18

Yuan JI · Accepted Answer · 2019-03-20 08:47:37Z

46

The col("name") gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name"):

val names = test.filter(test("id").equalTo("200"))
                .select("name")
                .collectAsList() // returns a List[Row]

Then for a row you could get name in String by:

val name = row.getString(0)

edited Mar 20, 2019 at 8:47

answered Jun 13, 2016 at 9:42

Yuan JI

2,9952 gold badges23 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

axiom Over a year ago

How do I do that in Python?

Vaibhav Over a year ago

Did you try col_list = df.select('mycol').collect() ?

Rajiv Singh · Accepted Answer · 2019-01-24 14:40:50Z

12

val maxDate = spark.sql("select max(export_time) as export_time from  tier1_spend.cost_gcp_raw").first()

val rowValue = maxDate.get(0)

answered Jan 24, 2019 at 14:40

Rajiv Singh

1,0881 gold badge11 silver badges14 bronze badges

1 Comment

Josh Mc Over a year ago

For whatever reason .get(0) was unhappy for me. spark.sql(f"select count(1) from {databricks_table}").first()[0] worked.

Sarath Subramanian · Accepted Answer · 2021-01-27 12:31:22Z

Let us assume you need to pick the name from the below table for a particular Id and store that value in a variable.

+-----+-------+
| id  | name  |
+-----+-------+
| 100 | Alex  |
| 200 | Bidan |
| 300 | Cary  |
+-----+-------+

SCALA
-----------

Irrelevant data is filtered out first and then the name column is selected and finally stored into name variable

var name = df.filter($"id" === "100").select("name").collect().map(_.getString(0)).mkString("")

PYTHON (PYSPARK)
-----------------------------

For simpler usage, I have created a function that returns the value by passing the dataframe and the desired column name to this (this is spark Dataframe and not Pandas Dataframe). Before passing the dataframe to this function, filter is applied to filter out other records.

def GetValueFromDataframe(_df,columnName):
    for row in _df.rdd.collect():       
        return row[columnName].strip()

name = GetValueFromDataframe(df.filter(df.id == "100"),"name")

There might be more simpler approach than this using 3x version of Python. The code which I showed above was tested for 2.7 version.

Note :
It is most likely to encounter out of memory error (Driver memory) since we use the collect function. Hence it is always recommended to apply transformations (like filter,where etc) before you call the collect function. If you still encounter with driver out of memory issue, you could pass --conf spark.driver.maxResultSize=0 as command line argument to make use of unlimited driver memory.

Aman Sehgal · Accepted Answer · 2020-01-07 01:33:32Z

5

By this snippet, you can extract all the values in a column into a string. Modify the snippet with where clauses to get your desired value.

val df = Seq((5, 2), (10, 1)).toDF("A", "B")

val col_val_df = df.select($"A").collect()
val col_val_str = col_val_df.map(x => x.get(0)).mkString(",")

/*
df: org.apache.spark.sql.DataFrame = [A: int, B: int]
col_val_row: Array[org.apache.spark.sql.Row] = Array([5], [10])
col_val_str: String = 5,10
*/

The value of entire column is stored in col_val_str

col_val_str: String = 5,10

answered Jan 7, 2020 at 1:33

Aman Sehgal

5664 silver badges15 bronze badges

Comments

afeldman · Accepted Answer · 2020-01-07 01:41:40Z

2

For anyone interested below is an way to turn a column into an Array, for the below case we are just taking the first value.

val names= test.filter(test("id").equalTo("200")).selectExpr("name").rdd.map(x=>x.mkString).collect
val name = names(0)

answered Jan 7, 2020 at 1:41

afeldman

5022 silver badges11 bronze badges

Comments

Priyank Pathak · Accepted Answer · 2021-02-05 18:48:19Z

0

s is the string of column values .collect() converts columns/rows to an array of lists, in this case, all rows will be converted to a tuple, temp is basically an array of such tuples/row.

x(n-1) retrieves the n-th column value for x-th row, which is by default of type "Any", so needs to be converted to String so as to append to the existing strig.

s =""
// say the n-th column is the target column 
val temp = test.collect() // converts Rows to array of list 
temp.foreach{x => 
            s += (x(n-1).asInstanceOf[String])   
        }

println(s)

edited Feb 5, 2021 at 18:48

answered Feb 4, 2021 at 18:07

Priyank Pathak

5044 silver badges18 bronze badges

2 Comments

STA Over a year ago

Thank you for contributing an answer. Would you kindly edit your answer to to include an explanation of your code? That will help future readers better understand what is going on, and especially those members of the community who are new to the language and struggling to understand the concepts.

Priyank Pathak Over a year ago

I see... let me do that

Collectives™ on Stack Overflow

Spark dataframe get column value into a string variable

6 Answers 6

2 Comments

1 Comment

Comments

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

1 Comment

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related