18

I want to take a json file and map it so that one of the columns is a substring of another. For example to take the left table and produce the right table:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  hello  |

I can do this using spark-sql syntax but how can it be done using the in-built functions?

2
  • Will column a always be two words delimited by a comma? And will column b always be the first word? Commented Mar 16, 2017 at 0:31
  • no and no, ideally the solution should run a substring function over column a values to produce column b Commented Mar 16, 2017 at 0:42

6 Answers 6

27

Such statement can be used

import org.apache.spark.sql.functions._

dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))

Sign up to request clarification or add additional context in comments.

2 Comments

Do you have any syntax reference for above code..I am not able to understand the syntax part of it. Thanks!
Spark functions "col", "substring_index" are used. Functions described here: spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/…
12

Suppose you have the following dataframe:

import spark.implicits._
import org.apache.spark.sql.functions._

var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")

+------+---+
|     a|  b|
+------+---+
|foobar|foo|
+------+---+

You could subset a new column from the first column as follows:

df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))

+------+---+---+
|     a|  b|  c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+

Comments

6

You would use the withColumn function

import org.apache.spark.sql.functions.{ udf, col }
def substringFn(str: String) = your substring code
val substring = udf(substringFn _)
dataframe.withColumn("b", substring(col("a"))

2 Comments

UDFs are bad because, depending on what you do in them, the query planner/optimizer may not be able to "see through" it.
@JonWatte This is a good point. Keep in mind that there are some cases when the functions that spark provide are not enough, for instance: converting long/lat columns into a geohash.
6

Just to enrich existing answers. In case you were interested in the right part of the string column. That is:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  world  |

You should use a negative index:

dataFrame.select(col("a"), substring_index(col("a"), ",", -1).as("b"))

Comments

4

You can just do it by using the pyspark way, like in the following example:

df.withColumn('New_col', df['Old_col'].substr(0, 7)).show()

Comments

1

if you want to get substring from the beginning of string then count their index from 0, where letter 'h' has 7th and letter 'o' has 11th index:

from pyspark.sql.functions import substring
df = df.withColumn('b', col('a').substr(7, 11))

if you want to get last 5 strings and word 'hello' with length equal to 5 in a column, then use:

df = df.withColumn('b', col('a').substr(-5,5))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.