create substring column in spark dataframe

Question

I want to take a json file and map it so that one of the columns is a substring of another. For example to take the left table and produce the right table:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  hello  |

I can do this using spark-sql syntax but how can it be done using the in-built functions?

Will column a always be two words delimited by a comma? And will column b always be the first word? — soote
– soote, Commented Mar 16, 2017 at 0:31
no and no, ideally the solution should run a substring function over column a values to produce column b — TheRealJimShady
– TheRealJimShady, Commented Mar 16, 2017 at 0:42

pasha701 · Accepted Answer · 2017-03-16 11:48:35Z

27

Such statement can be used

import org.apache.spark.sql.functions._

dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))

answered Mar 16, 2017 at 11:48

pasha701

7,2171 gold badge17 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Abhi Over a year ago

Do you have any syntax reference for above code..I am not able to understand the syntax part of it. Thanks!

pasha701 Over a year ago

Spark functions "col", "substring_index" are used. Functions described here: spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/…

Balázs Fehér · Accepted Answer · 2018-02-15 11:46:15Z

12

Suppose you have the following dataframe:

import spark.implicits._
import org.apache.spark.sql.functions._

var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")

+------+---+
|     a|  b|
+------+---+
|foobar|foo|
+------+---+

You could subset a new column from the first column as follows:

df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))

+------+---+---+
|     a|  b|  c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+

answered Feb 15, 2018 at 11:46

Balázs Fehér

3446 silver badges14 bronze badges

Comments

soote · Accepted Answer · 2017-03-16 01:38:42Z

6

You would use the withColumn function

import org.apache.spark.sql.functions.{ udf, col }
def substringFn(str: String) = your substring code
val substring = udf(substringFn _)
dataframe.withColumn("b", substring(col("a"))

edited Mar 16, 2017 at 1:38

answered Mar 16, 2017 at 1:21

soote

3,2801 gold badge25 silver badges36 bronze badges

2 Comments

Jon Watte Over a year ago

UDFs are bad because, depending on what you do in them, the query planner/optimizer may not be able to "see through" it.

soote Over a year ago

@JonWatte This is a good point. Keep in mind that there are some cases when the functions that spark provide are not enough, for instance: converting long/lat columns into a geohash.

Ignacio Alorre · Accepted Answer · 2018-10-07 12:40:09Z

6

Just to enrich existing answers. In case you were interested in the right part of the string column. That is:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  world  |

You should use a negative index:

dataFrame.select(col("a"), substring_index(col("a"), ",", -1).as("b"))

answered Oct 7, 2018 at 12:40

Ignacio Alorre

7,6558 gold badges65 silver badges104 bronze badges

Comments

sɐunıɔןɐqɐp · Accepted Answer · 2020-08-13 09:36:34Z

4

You can just do it by using the pyspark way, like in the following example:

df.withColumn('New_col', df['Old_col'].substr(0, 7)).show()

edited Aug 13, 2020 at 9:36

sɐunıɔןɐqɐp

3,59516 gold badges41 silver badges43 bronze badges

answered Aug 12, 2020 at 16:19

KeepLearning

5257 silver badges10 bronze badges

Comments

Saltanat Khalyk · Accepted Answer · 2023-03-15 08:02:21Z

1

if you want to get substring from the beginning of string then count their index from 0, where letter 'h' has 7th and letter 'o' has 11th index:

from pyspark.sql.functions import substring
df = df.withColumn('b', col('a').substr(7, 11))

if you want to get last 5 strings and word 'hello' with length equal to 5 in a column, then use:

df = df.withColumn('b', col('a').substr(-5,5))

edited Mar 15, 2023 at 8:02

answered Mar 15, 2023 at 7:43

Saltanat Khalyk

793 bronze badges

Collectives™ on Stack Overflow

create substring column in spark dataframe

6 Answers 6

2 Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related