0

I have a column with arrays of strings, e.g. like this:

["test.a" "random.ac"]
["test.41" "random.23" "test.123"]

I want to get only the text before the ".". I did it only for the fist element of the array. How do I do it to all elements? Preferably without UDFs.

df = df.withColumn("address", substring_index(col("all_addresses").getItem(0), ".", 1)))

2 Answers 2

1

I would use a similar idea as @wwnde - transform function. transform takes an array, transforms every its element according to the provided function and results in the array of the same size, but with changed elements. Exactly what you need.

However, having the same original idea, I would probably implement it differently.
2 options:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(["test.a", "random.ac"],),
     (["test.41", "random.23", "test.123"],)],
    ['c1']
)

df = df.withColumn('c2', F.transform('c1', lambda x: F.element_at(F.split(x, '\.'), 1)))
df = df.withColumn('c3', F.transform('c1', lambda x: F.regexp_extract(x, r'(.+)\.', 1)))

df.show()
# +--------------------+--------------------+--------------------+
# |                  c1|                  c2|                  c3|
# +--------------------+--------------------+--------------------+
# | [test.a, random.ac]|      [test, random]|      [test, random]|
# |[test.41, random....|[test, random, test]|[test, random, test]|
# +--------------------+--------------------+--------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

F.transform is on available as a functionin spark 3. If you need to use it in spark 2.x you can use expr("transform( col, col -> col != 0)
1

Use regexp_extract. extra alphanumerics matches if only followed by special character .

+---+------------------------------+
|id |text                          |
+---+------------------------------+
|1  |[test.a, random.ac]           |
|2  |[test.41, random.23, test.123]|
+---+------------------------------+



 df.withColumn('new', expr("transform(text,x->regexp_extract(x,'[a-z0-9]+(?=\.)',0))")).show()

+---+--------------------+--------------------+
| id|                text|                 new|
+---+--------------------+--------------------+
|  1| [test.a, random.ac]|      [test, random]|
|  2|[test.41, random....|[test, random, test]|
+---+--------------------+--------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.