pyspark: substring a string using dynamic index

Question

filtered_df = filtered_df.withColumn('POINT', substring('POINT', instr(filtered_df.POINT, "#"), 30))

I need to get the first index of the # in the string and then pass that index as the substring starting position as above. What would be the way to do that?

This gives me TypeError: Column is not iterable.

Do you really need substring function or the index? Seems you could ''.join(string.split("#")[1:]) — OneCricketeer
– OneCricketeer, Commented Feb 23, 2022 at 19:19
filtered_df = filtered_df.withColumn('POINT', split(filtered_df['POINT'], "#")[1:]) gives startPos and length must be the same type. Got <class 'int'> and <class 'NoneType'>, respectively. — codebot
– codebot, Commented Feb 23, 2022 at 19:25

ScootCork · Accepted Answer · 2022-07-10 11:18:41Z

4

The substring function from pyspark.sql.functions only takes fixed starting position and length. However your approach will work using an expression.

import pyspark.sql.functions as F

d = [{'POINT': 'The quick # brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog'},
    {'POINT': 'The quick brown fox jumps over the lazy dog.# The quick brown fox jumps over the lazy dog.'}]
df = spark.createDataFrame(d)

df.withColumn('POINT', F.expr("substring(POINT, instr(POINT, '#'), 30)")).show(2, False)

+------------------------------+
|POINT                         |
+------------------------------+
|# brown fox jumps over the laz|
|# The quick brown fox jumps ov|
+------------------------------+

answered Jul 10, 2022 at 11:18

ScootCork

3,70616 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark: substring a string using dynamic index

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related