34

I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows:

input_path = <s3_location_str>
my_expr = "Arizona.*hot"  # a regex expression
dx = sqlContext.read.parquet(input_path)  # "keyword" is a field in dx

# is the following correct?
substr = "'%%%s%%'" %my_keyword  # escape % via %% to get "%"
dk = dx.filter("keyword like %s" %substr)

# dk should contain rows with keyword values such as "Arizona is hot."

Note

I'm trying to get all rows in dx that contain the expression my_keyword. Otherwise, for exact matches we wouldn't need surrounding percent signs '%'.

3 Answers 3

48

From neeraj's hint, it seems like the correct way to do this in pyspark is:

expr = "Arizona.*hot"
dk = dx.filter(dx["keyword"].rlike(expr))

Note that dx.filter($"keyword" ...) did not work since (my version) of pyspark didn't seem to support the $ nomenclature out of the box.

Sign up to request clarification or add additional context in comments.

1 Comment

I'd recommend using implicit column selection, as opposed to referencing dx twice. e.g., dk = dk.filter(F.col("keyword").rlike(expr)). This is recommended per the Palantir PySpark Style Guide, as it makes the code more portable (you don't have to update dk in both locations). For clarity, you'll need from pyspark.sql import functions as F.
13

Try rlike function as mentioned below.

df.filter(<column_name> rlike "<regex_pattern>")

for example.

dk = dx.filter($"keyword" rlike "<pattern>")

1 Comment

Is this Scala? Pyspark doesn't seem to support col rlike expr syntax.
8

I used the following for the timestamp regex

expression = r'[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]) (2[0-3]|[01][0-9]):[0-5][0-9]:[0-5][0-9]'
df1 = df.filter(df['eta'].rlike(expression))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.