Apply function to all elements in array<string> column

Question

I have a column with arrays of strings, e.g. like this:

["test.a" "random.ac"]
["test.41" "random.23" "test.123"]

I want to get only the text before the ".". I did it only for the fist element of the array. How do I do it to all elements? Preferably without UDFs.

df = df.withColumn("address", substring_index(col("all_addresses").getItem(0), ".", 1)))

ZygD · Accepted Answer · 2022-06-09 08:27:20Z

1

I would use a similar idea as @wwnde - transform function. transform takes an array, transforms every its element according to the provided function and results in the array of the same size, but with changed elements. Exactly what you need.

However, having the same original idea, I would probably implement it differently.
2 options:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(["test.a", "random.ac"],),
     (["test.41", "random.23", "test.123"],)],
    ['c1']
)

df = df.withColumn('c2', F.transform('c1', lambda x: F.element_at(F.split(x, '\.'), 1)))
df = df.withColumn('c3', F.transform('c1', lambda x: F.regexp_extract(x, r'(.+)\.', 1)))

df.show()
# +--------------------+--------------------+--------------------+
# |                  c1|                  c2|                  c3|
# +--------------------+--------------------+--------------------+
# | [test.a, random.ac]|      [test, random]|      [test, random]|
# |[test.41, random....|[test, random, test]|[test, random, test]|
# +--------------------+--------------------+--------------------+

answered Jun 9, 2022 at 8:27

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Matt Andruff Over a year ago

F.transform is on available as a functionin spark 3. If you need to use it in spark 2.x you can use expr("transform( col, col -> col != 0)

wwnde · Accepted Answer · 2022-06-08 21:48:40Z

1

Use regexp_extract. extra alphanumerics matches if only followed by special character .

+---+------------------------------+
|id |text                          |
+---+------------------------------+
|1  |[test.a, random.ac]           |
|2  |[test.41, random.23, test.123]|
+---+------------------------------+



 df.withColumn('new', expr("transform(text,x->regexp_extract(x,'[a-z0-9]+(?=\.)',0))")).show()

+---+--------------------+--------------------+
| id|                text|                 new|
+---+--------------------+--------------------+
|  1| [test.a, random.ac]|      [test, random]|
|  2|[test.41, random....|[test, random, test]|
+---+--------------------+--------------------+

answered Jun 8, 2022 at 21:48

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Collectives™ on Stack Overflow

Apply function to all elements in array<string> column

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related