1

I have a Dataframe like this

Studentname  Speciality
Alex         ["Physics","Math","biology"]
Sam          ["Economics","History","Math","Physics"]
Claire       ["Political science,Physics"]

I want to find all students who has speciality in [Physics,Math], so the output should have 2 rows Alex,Sam

This is what i have tried

from pyspark.sql.functions import array_contains
from pyspark.sql import functions as F

def student_info():
     student_df = spark.read.parquet("s3a://studentdata")
     a1=["Physics","Math"]
     df=student_df
     for a in a1:
       df= student_df.filter(array_contains(student_df.Speciality, a))
       print(df.count())

student_info()

output:
3
2

Would like to know how to filter array column based on a given subset of array

0

3 Answers 3

2

Here another approach leveraging array_sort and the Spark equality operator which handles arrays as any other type with the prerequisite that they are sorted:

from pyspark.sql.functions import lit, array, array_sort, array_intersect

target_ar = ["Physics", "Math"]
search_ar = array_sort(array(*[lit(e) for e in target_ar]))

df.where(array_sort(array_intersect(df["Speciality"], search_ar)) == search_ar) \
  .show(10, False)

# +-----------+-----------------------------------+
# |Studentname|Speciality                         |
# +-----------+-----------------------------------+
# |Alex       |[Physics, Math, biology]           |
# |Sam        |[Economics, History, Math, Physics]|
# +-----------+-----------------------------------+

First we find the common elements with array_intersect(df["Speciality"], search_ar) then we use == to compare the sorted arrays.

Sign up to request clarification or add additional context in comments.

Comments

2

Using higher order function filter should be the most scalable and efficient way to do this ( Spark2.4 )

from pyspark.sql import functions as F
df.withColumn("new", F.size(F.expr("""filter(Speciality, x-> x=='Math' or x== 'Physics')""")))\
  .filter("new=2").drop("new").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality                         |
+-----------+-----------------------------------+
|Alex       |[Physics, Math, biology]           |
|Sam        |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+

If you want to use an an array like a1 to dynamically do this, you can use F.array_except and F.array and then filter on size ( spark 2.4 ):

a1=['Math','Physics']
df.withColumn("array", F.array_except("Speciality",F.array(*(F.lit(x) for x in a1))))\
  .filter("size(array)= size(Speciality)-2").drop("array").show(truncate=False)

+-----------+-----------------------------------+
|Studentname|Speciality                         |
+-----------+-----------------------------------+
|Alex       |[Physics, Math, biology]           |
|Sam        |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+

To get count, you could put .count() instead of .show()

1 Comment

So we have transform , also filter (TIL), nice one
0

Assuming you have, no duplicates are in Speciality for a student (e.g.

StudentName   Speciality
SomeStudent   ['Physics', 'Math', 'Biology', 'Physics']

You can use explode with groupby in pandas

So, for your problem

# df is above dataframe
# Lookup subjects
a1 = ['Physics', 'Math']

gdata = df.explode('Speciality').groupby(['Speciality']).size().to_frame('Count')

gdata.loc[a1, 'Count']

#             Count
# Speciality
# Physics         3
# Math            2

1 Comment

There can be duplicates in Speciality column

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.