8

I'm splitting an HTTP request to look at the elements, and I was wondering if there was a way to specify the element I'd like to look at in the same call without having to do another operation.

For example:

from pyspark.sql import functions as fn

df.select(fn.split(df.http_request, '/').alias('http'))

gives me a new Dataframe with rows of arrays like this:

+--------------------+
|                http|
+--------------------+
|[, courses, 26420...|

I want the item in index 1 (courses) without having to then do another select statement to specify df.select(df.http[1]) or whatever. Is this possible?

3 Answers 3

14

Use getItem. I'd say don't use python UDF just to make the code looks prettier - it's much slower than native DataFrame functions (due to moving data between python and JVM).

from pyspark.sql import functions as F
df.select(F.split(df.http_request, '/').alias('http').getItem(1))
Sign up to request clarification or add additional context in comments.

Comments

0

An alternative using selectExpr:

df.selectExpr("http[1] as http_2nd_item")

Comments

-1

Well you could define a UDF:

from pyspark.sql.functions import *
from pyspark.sql.types import *

def getter(column, index):
    return column[index]

getterUDF = udf(getter, StringType())

df.select(getterUDF(split(df.http_request, '/').alias('http'), lit(1)))

You could also use the getItem method recommended by @max

df.select(F.split(df.http_request, '/').alias('http').getItem(1))

1 Comment

Using a python UDF will lead to poor performance. The solution with getItem would be preferred.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.