Filter array column in a dataframe based on a given input array --Pyspark

Question

I have a Dataframe like this

Studentname  Speciality
Alex         ["Physics","Math","biology"]
Sam          ["Economics","History","Math","Physics"]
Claire       ["Political science,Physics"]

I want to find all students who has speciality in [Physics,Math], so the output should have 2 rows Alex,Sam

This is what i have tried

from pyspark.sql.functions import array_contains
from pyspark.sql import functions as F

def student_info():
     student_df = spark.read.parquet("s3a://studentdata")
     a1=["Physics","Math"]
     df=student_df
     for a in a1:
       df= student_df.filter(array_contains(student_df.Speciality, a))
       print(df.count())

student_info()

output:
3
2

Would like to know how to filter array column based on a given subset of array

abiratsis · Accepted Answer · 2020-03-26 09:48:45Z

2

Here another approach leveraging array_sort and the Spark equality operator which handles arrays as any other type with the prerequisite that they are sorted:

from pyspark.sql.functions import lit, array, array_sort, array_intersect

target_ar = ["Physics", "Math"]
search_ar = array_sort(array(*[lit(e) for e in target_ar]))

df.where(array_sort(array_intersect(df["Speciality"], search_ar)) == search_ar) \
  .show(10, False)

# +-----------+-----------------------------------+
# |Studentname|Speciality                         |
# +-----------+-----------------------------------+
# |Alex       |[Physics, Math, biology]           |
# |Sam        |[Economics, History, Math, Physics]|
# +-----------+-----------------------------------+

First we find the common elements with array_intersect(df["Speciality"], search_ar) then we use == to compare the sorted arrays.

edited Mar 26, 2020 at 9:48

answered Mar 25, 2020 at 12:17

abiratsis

7,3414 gold badges31 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

murtihash · Accepted Answer · 2020-03-25 00:05:41Z

2

Using higher order function filter should be the most scalable and efficient way to do this ( Spark2.4 )

from pyspark.sql import functions as F
df.withColumn("new", F.size(F.expr("""filter(Speciality, x-> x=='Math' or x== 'Physics')""")))\
  .filter("new=2").drop("new").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality                         |
+-----------+-----------------------------------+
|Alex       |[Physics, Math, biology]           |
|Sam        |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+

If you want to use an an array like a1 to dynamically do this, you can use F.array_except and F.array and then filter on size ( spark 2.4 ):

a1=['Math','Physics']
df.withColumn("array", F.array_except("Speciality",F.array(*(F.lit(x) for x in a1))))\
  .filter("size(array)= size(Speciality)-2").drop("array").show(truncate=False)

+-----------+-----------------------------------+
|Studentname|Speciality                         |
+-----------+-----------------------------------+
|Alex       |[Physics, Math, biology]           |
|Sam        |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+

To get count, you could put .count() instead of .show()

edited Mar 25, 2020 at 0:05

answered Mar 24, 2020 at 23:34

murtihash

8,4401 gold badge16 silver badges26 bronze badges

1 Comment

anky Over a year ago

So we have transform , also filter (TIL), nice one

DOOM · Accepted Answer · 2020-03-24 23:16:45Z

0

Assuming you have, no duplicates are in Speciality for a student (e.g.

StudentName   Speciality
SomeStudent   ['Physics', 'Math', 'Biology', 'Physics']

You can use explode with groupby in pandas

So, for your problem

# df is above dataframe
# Lookup subjects
a1 = ['Physics', 'Math']

gdata = df.explode('Speciality').groupby(['Speciality']).size().to_frame('Count')

gdata.loc[a1, 'Count']

#             Count
# Speciality
# Physics         3
# Math            2

answered Mar 24, 2020 at 23:16

DOOM

1,2547 silver badges22 bronze badges

1 Comment

shiv455 Over a year ago

There can be duplicates in Speciality column

Collectives™ on Stack Overflow

Filter array column in a dataframe based on a given input array --Pyspark

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related