2
filtered_df = filtered_df.withColumn('POINT', substring('POINT', instr(filtered_df.POINT, "#"), 30))

I need to get the first index of the # in the string and then pass that index as the substring starting position as above. What would be the way to do that?

This gives me TypeError: Column is not iterable.

2
  • Do you really need substring function or the index? Seems you could ''.join(string.split("#")[1:]) Commented Feb 23, 2022 at 19:19
  • filtered_df = filtered_df.withColumn('POINT', split(filtered_df['POINT'], "#")[1:]) gives startPos and length must be the same type. Got <class 'int'> and <class 'NoneType'>, respectively. Commented Feb 23, 2022 at 19:25

1 Answer 1

4

The substring function from pyspark.sql.functions only takes fixed starting position and length. However your approach will work using an expression.

import pyspark.sql.functions as F

d = [{'POINT': 'The quick # brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog'},
    {'POINT': 'The quick brown fox jumps over the lazy dog.# The quick brown fox jumps over the lazy dog.'}]
df = spark.createDataFrame(d)

df.withColumn('POINT', F.expr("substring(POINT, instr(POINT, '#'), 30)")).show(2, False)

+------------------------------+
|POINT                         |
+------------------------------+
|# brown fox jumps over the laz|
|# The quick brown fox jumps ov|
+------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.