0

I have a set of regex expressions in a pyspark ETL. regexp_extract can only return one group.

Therefore, I am stuck using regexp_replace but branch reset groups are not supported.

I was hoping to use named groups to keep track of what I am returning, and while I can specify them in the regex search pattern, when I try to backreference them using \k, $, or any other mechanism it fails with "Illegal group reference", so I'm forced to use the numbered groups and have to carefully keep track of their position and meaning.

Does anyone know how to reference named groups in the replacement pattern?

rdd = sc.parallelize([('test1', "Low: 100 High: 200"),
                      ('test2', "Normal: <=25"),
                      ('test3', "Normal: >=30"),
                      ('test4', "Normal: YELLOW")
                    ])
schema = StructType([StructField('id', StringType(), True),
                     StructField('referencerange', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
df = df.withColumn('hiloranges', regexp_replace('referencerange', 
                                                r"Low: (?<low1>-?[0-9.]+) High: (?<high1>-?[0-9.]+)"
                                                r"|Normal: <=(?<high2>-?[0-9.]+)\s?"
                                                r"|Normal: >=(?<low2>-?[0-9.]+)\s?"
                                                , "$1$4|$2$3"))
df = df.withColumn('range_low', split(col('hiloranges'),'\\|').getItem(0).cast(FloatType()))
df = df.withColumn('range_high', split(col('hiloranges'),'\\|').getItem(1).cast(FloatType()))

i.e. I want to use "$low1$low2|$high1$high2" instead of "$1$4|$2$3"

2
  • 1
    Try with \g<low1> Commented Nov 15, 2019 at 21:50
  • 1
    Yeah it looks like Python supports three ways to reference capture groups in the replacement, \1 or \g<1> or \g<name_of_1> according to the docs docs.python.org/3/library/re.html Commented Nov 15, 2019 at 22:37

1 Answer 1

1

In Spark SQL, the backreference to named groups is ${group}. So your desired string would be

${low1}${low2}|${high1}{$high2}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.