I am having trouble retrieving a value from a JSON string using regex in spark.
My pattern is:
val st1 = """id":"(.*?)"""
val pattern = s"${'"'}$st1${'"'}"
//pattern is: "id":"(.*?)"
My test string in a DF is
import spark.implicits._
val jsonStr = """{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"}"""
val df = sqlContext.sparkContext.parallelize(Seq(jsonStr)).toDF("request")
I am then trying to parse out the id value and add it to the df through a UDF like so:
def getSubStringGroup(pattern: String) = udf((request: String) => {
val patternWithResponseRegex = pattern.r
var subString = request match {
case patternWithResponseRegex(idextracted) => Array(idextracted)
case _ => Array("na")
}
subString
})
val dfWithIdExtracted = df.select($"request")
.withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
.withColumn("idextracted", $"patternMatchGroups".getItem(0))
.drop("patternMatchGroups")
So I want my df to look like
|------------------------------------------------------------- | ------------------------|
| request | id |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | 1d5482864c60d5bd07919490|
| -------------------------------------------------------------|-------------------------|
However, when I try the above method, my match comes back as "null" despite working on regex101.com
Could anyone advise or suggest a different method? Thank you.
Following Krzysztof's solution, my table now looks like so:
|------------------------------------------------------------- | ------------------------|
| request | id |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | "id":"1d5482864c60d5bd07919490"|
| -------------------------------------------------------------|-------------------------|
I created a new udf to trim the unnecessary characters and added it to the df:
def trimId = udf((idextracted: String) => {
val id = idextracted.drop(6).dropRight(1)
id
})
val dfWithIdExtracted = df.select($"request")
.withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
.withColumn("idextracted", $"patternMatchGroups".getItem(0))
.withColumn("id", trimId($"idextracted"))
.drop("patternMatchGroups", "idextracted")
The df now looks as desired. Thanks again Krzysztof!