0

I am trying to extract multiple sequences of numerical characters out of a long string. With regexp_extract, I am able to extract the first sequence, but how can I extract the following ones and write them into new columns (one column for every sequence)?

df.withColumn('IDENT1', F.df(bookings.IDENTIFIER, "(\d+)", 0)) 
+-----------------------------------------------------------------------------+----------+
|IDENTIFIER                                                                   |IDENT1    |                                                                                                                                  
+-----------------------------------------------------------------------------+----------+
|31010833 - 9911075145 - some more random text                                |31010833  |
|TL/9910900047/asdiqwoidw/more text// 23231263264                             |9910900047|
+-----------------------------------------------------------------------------+----------+

1 Answer 1

2

I would use a UDF to extract the numbers, put them in an array and then explode that array to columns, or just do the select.

Example:

import re
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType


def extract(text):
    all_matches = re.findall(r"-?\d(?:[,\d]*\.\d+|[,\d]*)", text)
    return all_matches


extract_udf = F.udf(extract, ArrayType(StringType()))

spark = SparkSession.builder.getOrCreate()
data = [
    {"test": "31010833 - 9911075145 - some more random text"},
    {"test": "TL/9910900047/asdiqwoidw/more text// 23231263264 "},
]
df = spark.createDataFrame(data=data)
df = df.withColumn("numbers", extract_udf("test"))
df = df.select(["test"] + [df.numbers[i] for i in range(2)])
df.show(10, False)
df.printSchema()

Result:

+-------------------------------------------------+----------+-----------+      
|test                                             |numbers[0]|numbers[1] |
+-------------------------------------------------+----------+-----------+
|31010833 - 9911075145 - some more random text    |31010833  |9911075145 |
|TL/9910900047/asdiqwoidw/more text// 23231263264 |9910900047|23231263264|
+-------------------------------------------------+----------+-----------+

root
 |-- test: string (nullable = true)
 |-- numbers[0]: string (nullable = true)
 |-- numbers[1]: string (nullable = true)

Also, if you use Spark 3.1+ you can use regexp_extract_all. See question: Extracting several regex matches in PySpark

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.