I have a set of regex expressions in a pyspark ETL. regexp_extract can only return one group.
Therefore, I am stuck using regexp_replace but branch reset groups are not supported.
I was hoping to use named groups to keep track of what I am returning, and while I can specify them in the regex search pattern, when I try to backreference them using \k, $, or any other mechanism it fails with "Illegal group reference", so I'm forced to use the numbered groups and have to carefully keep track of their position and meaning.
Does anyone know how to reference named groups in the replacement pattern?
rdd = sc.parallelize([('test1', "Low: 100 High: 200"),
('test2', "Normal: <=25"),
('test3', "Normal: >=30"),
('test4', "Normal: YELLOW")
])
schema = StructType([StructField('id', StringType(), True),
StructField('referencerange', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
df = df.withColumn('hiloranges', regexp_replace('referencerange',
r"Low: (?<low1>-?[0-9.]+) High: (?<high1>-?[0-9.]+)"
r"|Normal: <=(?<high2>-?[0-9.]+)\s?"
r"|Normal: >=(?<low2>-?[0-9.]+)\s?"
, "$1$4|$2$3"))
df = df.withColumn('range_low', split(col('hiloranges'),'\\|').getItem(0).cast(FloatType()))
df = df.withColumn('range_high', split(col('hiloranges'),'\\|').getItem(1).cast(FloatType()))
i.e. I want to use "$low1$low2|$high1$high2" instead of "$1$4|$2$3"
\g<low1>\1or\g<1>or\g<name_of_1>according to the docs docs.python.org/3/library/re.html