0

I need to get a substring from a column of a dataframe that starts at a fixed number and goes all the way to the end. So, for example, for one row the substring starts at 7 and goes to 20, for another it starts at 7 and goes to 21. How can I define this within the colonn creation?

columns = ["key", "value"]
data = [("key1", "09-2021_Books"), ("key2", "09-2021_Cds, value4"), ("key3", "09-2021_Computers"),]
df = spark.createDataFrame(data).toDF(*columns)

I'd like a column that starts after "_" and has only "Books", "Cds", "Computers". I tried with

df.withColumn("items", substring("value", 7, length("value") )).show()

This is the traceback:

TypeError: Column is not iterable
---> 30 df.withColumn("items", substring("value", 7, length("value") )).show()
 31 

/databricks/spark/python/pyspark/sql/functions.py in substring(str, pos, len)
   2944     """
   2945     sc = SparkContext._active_spark_context
-> 2946     return Column(sc._jvm.functions.substring(_to_java_column(str), pos, len))
   2947 
   2948 

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1294 
   1295     def __call__(self, *args):
-> 1296         args_command, temp_args = self._build_args(*args)
   1297 
   1298         command = proto.CALL_COMMAND_NAME +\
2
  • use the sql function within an expr('substr(x, 1, n)'). the pyspark functions accept specific inputs only and substring accepts column as frst input and integers for the rest of the inputs Commented Sep 30, 2022 at 10:26
  • this Q is quite similar and can be helpful in your problem. Commented Sep 30, 2022 at 10:32

1 Answer 1

1

The split function from pyspark.sql.functions will work for you. As a second argument of split we need to pass a regular expression, so just provide a regex matching first 8 characters. After the split just take the second entry of the resulting array (0-based).

import pyspark.sql.functions as sf

columns = ["key", "value"]
data = [("key1", "09-2021_Books"), ("key2", "09-2021_Cds, value4"), ("key3", "09-2021_Computers"),]
df = spark.createDataFrame(data).toDF(*columns)

df.withColumn("items", sf.split("value", "^.{8}").getItem(1)).show()

However, in your example I think that better option would be to split this column by a delimiter:

import pyspark.sql.functions as sf

columns = ["key", "value"]
data = [("key1", "09-2021_Books"), ("key2", "09-2021_Cds, value4"), ("key3", "09-2021_Computers"),]
df = spark.createDataFrame(data).toDF(*columns)

df.withColumn("items", sf.split("value", "_", 2).getItem(1)).show()

Third argument of split controls how many entries resulting array will contain.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.