Select array element from Spark Dataframes split method in same call?

Question

I'm splitting an HTTP request to look at the elements, and I was wondering if there was a way to specify the element I'd like to look at in the same call without having to do another operation.

For example:

from pyspark.sql import functions as fn

df.select(fn.split(df.http_request, '/').alias('http'))

gives me a new Dataframe with rows of arrays like this:

+--------------------+
|                http|
+--------------------+
|[, courses, 26420...|

I want the item in index 1 (courses) without having to then do another select statement to specify df.select(df.http[1]) or whatever. Is this possible?

max · Accepted Answer · 2016-07-19 09:10:09Z

14

Use getItem. I'd say don't use python UDF just to make the code looks prettier - it's much slower than native DataFrame functions (due to moving data between python and JVM).

from pyspark.sql import functions as F
df.select(F.split(df.http_request, '/').alias('http').getItem(1))

answered Jul 19, 2016 at 9:10

max

52.7k60 gold badges224 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

datapug · Accepted Answer · 2019-08-08 20:48:22Z

0

An alternative using selectExpr:

df.selectExpr("http[1] as http_2nd_item")

answered Aug 8, 2019 at 20:48

datapug

2,4411 gold badge19 silver badges35 bronze badges

Comments

Alberto Bonsanto · Accepted Answer · 2019-11-27 20:54:41Z

-1

Well you could define a UDF:

from pyspark.sql.functions import *
from pyspark.sql.types import *

def getter(column, index):
    return column[index]

getterUDF = udf(getter, StringType())

df.select(getterUDF(split(df.http_request, '/').alias('http'), lit(1)))

You could also use the getItem method recommended by @max

df.select(F.split(df.http_request, '/').alias('http').getItem(1))

edited Nov 27, 2019 at 20:54

answered Jun 7, 2016 at 22:45

Alberto Bonsanto

18.1k10 gold badges67 silver badges93 bronze badges

1 Comment

franklynd Over a year ago

Using a python UDF will lead to poor performance. The solution with getItem would be preferred.

Collectives™ on Stack Overflow

Select array element from Spark Dataframes split method in same call?

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related