Pyspark create a column with a substring with variable length

Question

I need to get a substring from a column of a dataframe that starts at a fixed number and goes all the way to the end. So, for example, for one row the substring starts at 7 and goes to 20, for another it starts at 7 and goes to 21. How can I define this within the colonn creation?

columns = ["key", "value"]
data = [("key1", "09-2021_Books"), ("key2", "09-2021_Cds, value4"), ("key3", "09-2021_Computers"),]
df = spark.createDataFrame(data).toDF(*columns)

I'd like a column that starts after "_" and has only "Books", "Cds", "Computers". I tried with

df.withColumn("items", substring("value", 7, length("value") )).show()

This is the traceback:

TypeError: Column is not iterable
---> 30 df.withColumn("items", substring("value", 7, length("value") )).show()
 31 

/databricks/spark/python/pyspark/sql/functions.py in substring(str, pos, len)
   2944     """
   2945     sc = SparkContext._active_spark_context
-> 2946     return Column(sc._jvm.functions.substring(_to_java_column(str), pos, len))
   2947 
   2948 

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1294 
   1295     def __call__(self, *args):
-> 1296         args_command, temp_args = self._build_args(*args)
   1297 
   1298         command = proto.CALL_COMMAND_NAME +\

use the sql function within an expr('substr(x, 1, n)'). the pyspark functions accept specific inputs only and substring accepts column as frst input and integers for the rest of the inputs — samkart
– samkart, Commented Sep 30, 2022 at 10:26

Michal · Accepted Answer · 2022-09-30 14:54:09Z

The split function from pyspark.sql.functions will work for you. As a second argument of split we need to pass a regular expression, so just provide a regex matching first 8 characters. After the split just take the second entry of the resulting array (0-based).

import pyspark.sql.functions as sf

columns = ["key", "value"]
data = [("key1", "09-2021_Books"), ("key2", "09-2021_Cds, value4"), ("key3", "09-2021_Computers"),]
df = spark.createDataFrame(data).toDF(*columns)

df.withColumn("items", sf.split("value", "^.{8}").getItem(1)).show()

However, in your example I think that better option would be to split this column by a delimiter:

import pyspark.sql.functions as sf

columns = ["key", "value"]
data = [("key1", "09-2021_Books"), ("key2", "09-2021_Cds, value4"), ("key3", "09-2021_Computers"),]
df = spark.createDataFrame(data).toDF(*columns)

df.withColumn("items", sf.split("value", "_", 2).getItem(1)).show()

Third argument of split controls how many entries resulting array will contain.

Collectives™ on Stack Overflow

Pyspark create a column with a substring with variable length

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related