I have an alphanumeric column named "Result" that I'd like to parse into 4 different columns: prefix, suffix, value, and pure_text.
I'd like to solve this using Spark SQL using RLIKE and REGEX, but also open to PySpark/Scala
pure_text: contains only alphabets (or) if there are numbers present, then they should either have a special character "-" or multiple decimals (i.e. 9.9.0) or number followed by an alphabet and then a number again (i.e. 3x4u)
prefix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) before the 1st digit [0-9] needs to be extracted.
suffix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) after the last digit [0-9] needs to be extracted.
value: anything that can't be categorized into "pure_text" will be taken into consideration. extract all numerical values including the decimal point.
Result
11 H
111L
<.004
>= 0.78
val<=0.6
xyz 100 abc
1-9
aaa 100.3.4
a1q1
Expected Output:
Result Prefix Suffix Value Pure_Text
11 H H 11
111L L 111
.9 0.9
<.004 < 0.004
>= 0.78 >= 0.78
val<=0.6 val<= 0.6
xyz 100 abc xyz abc 100
1-9 1-9
aaa 100.3.4 aaa 100.3.4
a1q1 a1q1