regex of json string in data frame using spark scala

Question

I am having trouble retrieving a value from a JSON string using regex in spark.

My pattern is:

val st1 = """id":"(.*?)"""
val pattern = s"${'"'}$st1${'"'}"
//pattern is: "id":"(.*?)"

My test string in a DF is

import spark.implicits._
val jsonStr = """{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"}"""                         
val df = sqlContext.sparkContext.parallelize(Seq(jsonStr)).toDF("request")

I am then trying to parse out the id value and add it to the df through a UDF like so:

def getSubStringGroup(pattern: String) = udf((request: String) => {
  val patternWithResponseRegex = pattern.r
  var subString = request match {
    case patternWithResponseRegex(idextracted) => Array(idextracted)
    case _ => Array("na")
  }
  subString
})

val dfWithIdExtracted = df.select($"request")
  .withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
  .withColumn("idextracted", $"patternMatchGroups".getItem(0))
  .drop("patternMatchGroups")

So I want my df to look like

|------------------------------------------------------------- | ------------------------|
|      request                                                 |           id            |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | 1d5482864c60d5bd07919490|
| -------------------------------------------------------------|-------------------------|

However, when I try the above method, my match comes back as "null" despite working on regex101.com

Could anyone advise or suggest a different method? Thank you.

Following Krzysztof's solution, my table now looks like so:

|------------------------------------------------------------- | ------------------------|
|      request                                                 |           id            |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | "id":"1d5482864c60d5bd07919490"|
| -------------------------------------------------------------|-------------------------|

I created a new udf to trim the unnecessary characters and added it to the df:

def trimId = udf((idextracted: String) => {
  val id = idextracted.drop(6).dropRight(1)
  id
})


val dfWithIdExtracted = df.select($"request")
  .withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
  .withColumn("idextracted", $"patternMatchGroups".getItem(0))
  .withColumn("id", trimId($"idextracted"))
  .drop("patternMatchGroups", "idextracted")

The df now looks as desired. Thanks again Krzysztof!

Krzysztof Atłasik · Accepted Answer · 2019-06-03 19:51:39Z

1

When you're using pattern matching with regex, you're trying to match whole string, which obviously can't succeed. You should rather use findFirstMatchIn:

def getSubStringGroup(pattern: String) = udf((request: String) => {
   val patternWithResponseRegex = pattern.r
   patternWithResponseRegex.findFirstIn(request).map(Array(_)).getOrElse(Array("na"))
})

You're also creating your pattern in a very bizarre way unless you've got special use case for it. You could just do:

val pattern = """"id":"(.*?)""""

edited Jun 3, 2019 at 19:51

answered Jun 3, 2019 at 19:29

Krzysztof Atłasik

22.8k6 gold badges57 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Zack Over a year ago

I do the pattern because I am trying to get "id":"(.*?)". Your way leaves out the final double quote "id":"(.*?)! Your solution does not work with your pattern..but it does work with my pattern! It includes the entire string "id":"1d5482864c60d5bd07919490", but that can be handled with some string trimming. I will post the final solution in a few minutes for anyone who comes across this problem in the future. Thank you Krzysztof!

Krzysztof Atłasik Over a year ago

You're right, I lost one quote when I was copying the last snippet. With 4 quotes it works. I edited my question.

Collectives™ on Stack Overflow

regex of json string in data frame using spark scala

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related