Backreference named groups with regexp_replace in pyspark

Question

I have a set of regex expressions in a pyspark ETL. regexp_extract can only return one group.

Therefore, I am stuck using regexp_replace but branch reset groups are not supported.

I was hoping to use named groups to keep track of what I am returning, and while I can specify them in the regex search pattern, when I try to backreference them using \k, $, or any other mechanism it fails with "Illegal group reference", so I'm forced to use the numbered groups and have to carefully keep track of their position and meaning.

Does anyone know how to reference named groups in the replacement pattern?

rdd = sc.parallelize([('test1', "Low: 100 High: 200"),
                      ('test2', "Normal: <=25"),
                      ('test3', "Normal: >=30"),
                      ('test4', "Normal: YELLOW")
                    ])
schema = StructType([StructField('id', StringType(), True),
                     StructField('referencerange', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
df = df.withColumn('hiloranges', regexp_replace('referencerange', 
                                                r"Low: (?<low1>-?[0-9.]+) High: (?<high1>-?[0-9.]+)"
                                                r"|Normal: <=(?<high2>-?[0-9.]+)\s?"
                                                r"|Normal: >=(?<low2>-?[0-9.]+)\s?"
                                                , "$1$4|$2$3"))
df = df.withColumn('range_low', split(col('hiloranges'),'\\|').getItem(0).cast(FloatType()))
df = df.withColumn('range_high', split(col('hiloranges'),'\\|').getItem(1).cast(FloatType()))

i.e. I want to use "$low1$low2|$high1$high2" instead of "$1$4|$2$3"

Yeah it looks like Python supports three ways to reference capture groups in the replacement, \1 or \g<1> or \g<name_of_1> according to the docs docs.python.org/3/library/re.html — user12097764
– user12097764, Commented Nov 15, 2019 at 22:37

ZygD · Accepted Answer · 2022-08-29 10:45:17Z

1

In Spark SQL, the backreference to named groups is ${group}. So your desired string would be

${low1}${low2}|${high1}{$high2}

edited Aug 29, 2022 at 10:45

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

answered Aug 24, 2022 at 21:55

Twilight

211 bronze badge

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Backreference named groups with regexp_replace in pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related