1

I am having trouble retrieving a value from a JSON string using regex in spark.

My pattern is:

val st1 = """id":"(.*?)"""
val pattern = s"${'"'}$st1${'"'}"
//pattern is: "id":"(.*?)"

My test string in a DF is

import spark.implicits._
val jsonStr = """{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"}"""                         
val df = sqlContext.sparkContext.parallelize(Seq(jsonStr)).toDF("request")   

I am then trying to parse out the id value and add it to the df through a UDF like so:

def getSubStringGroup(pattern: String) = udf((request: String) => {
  val patternWithResponseRegex = pattern.r
  var subString = request match {
    case patternWithResponseRegex(idextracted) => Array(idextracted)
    case _ => Array("na")
  }
  subString
})

val dfWithIdExtracted = df.select($"request")
  .withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
  .withColumn("idextracted", $"patternMatchGroups".getItem(0))
  .drop("patternMatchGroups")

So I want my df to look like

|------------------------------------------------------------- | ------------------------|
|      request                                                 |           id            |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | 1d5482864c60d5bd07919490|
| -------------------------------------------------------------|-------------------------|

However, when I try the above method, my match comes back as "null" despite working on regex101.com

Could anyone advise or suggest a different method? Thank you.

Following Krzysztof's solution, my table now looks like so:

|------------------------------------------------------------- | ------------------------|
|      request                                                 |           id            |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | "id":"1d5482864c60d5bd07919490"|
| -------------------------------------------------------------|-------------------------|

I created a new udf to trim the unnecessary characters and added it to the df:

def trimId = udf((idextracted: String) => {
  val id = idextracted.drop(6).dropRight(1)
  id
})


val dfWithIdExtracted = df.select($"request")
  .withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
  .withColumn("idextracted", $"patternMatchGroups".getItem(0))
  .withColumn("id", trimId($"idextracted"))
  .drop("patternMatchGroups", "idextracted")

The df now looks as desired. Thanks again Krzysztof!

1 Answer 1

1

When you're using pattern matching with regex, you're trying to match whole string, which obviously can't succeed. You should rather use findFirstMatchIn:

def getSubStringGroup(pattern: String) = udf((request: String) => {
   val patternWithResponseRegex = pattern.r
   patternWithResponseRegex.findFirstIn(request).map(Array(_)).getOrElse(Array("na"))
})

You're also creating your pattern in a very bizarre way unless you've got special use case for it. You could just do:

val pattern = """"id":"(.*?)""""
Sign up to request clarification or add additional context in comments.

2 Comments

I do the pattern because I am trying to get "id":"(.*?)". Your way leaves out the final double quote "id":"(.*?)! Your solution does not work with your pattern..but it does work with my pattern! It includes the entire string "id":"1d5482864c60d5bd07919490", but that can be handled with some string trimming. I will post the final solution in a few minutes for anyone who comes across this problem in the future. Thank you Krzysztof!
You're right, I lost one quote when I was copying the last snippet. With 4 quotes it works. I edited my question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.