Pass an array into an SQL query using format in pyspark

Question

I would like to do the following query by passing the value of concepts as a parameter value to the UDF has_any_concept.

The following is in the environment

concepts

['CREATININE_QUANTITATIVE_24_HOUR_DIALYSIS_FLUID_OBSTYPE',
 'CREATININE_QUANTITATIVE_24_HOUR_URINE_OBSTYPE',
 'CREATININE_QUANTITATIVE_SERUM_OBSTYPE']

This is the query without passing the parameters.

(spark.sql("""
select 
   
   resultCode.standard.primaryDisplay                                           as display
   
   from results 
   WHERE has_any_concept(resultCode, array("CREATININE_QUANTITATIVE_24_HOUR_DIALYSIS_FLUID_OBSTYPE","CREATININE_QUANTITATIVE_24_HOUR_URINE_OBSTYPE","CREATININE_QUANTITATIVE_SERUM_OBSTYPE"))
   
   LIMIT 3
""".format(concepts = concepts))\
   .toPandas()
)

display
0   Creatinine [Mass/volume] in Serum or Plasma
1   Creatinine [Mass/volume] in Serum or Plasma
2   Creatinine [Mass/volume] in Serum or Plasma

This also works

(spark.sql("""
select 
   
   resultCode.standard.primaryDisplay                                           as display,
   ontologicalCategoryAliases                                                   as category
   
   from results 
   WHERE has_any_concept(resultCode, array("{concepts[0]}","{concepts[1]}","{concepts[2]}"))
   
   LIMIT 3
""".format(concepts = concepts))\
   .toPandas()
)

display     category
0   Creatinine [Mass/volume] in Serum or Plasma     [LABS_OBSTYPE]
1   Creatinine [Mass/volume] in Serum or Plasma     [LABS_OBSTYPE]
2   Creatinine [Mass/volume] in Serum or Plasma     [LABS_OBSTYPE]

This does not work

(spark.sql("""
select 
   
   resultCode.standard.primaryDisplay                                           as display,
   ontologicalCategoryAliases                                                   as category
   
   from results 
   WHERE has_any_concept(resultCode, array({concepts}))
   
   LIMIT 3
""".format(concepts = [''' "{concept}"   '''.format(concept = concept) for concept in concepts]))\
   .toPandas()
)

ParseException: '\nmismatched input \'from\' expecting <EOF>(line 7, pos 3)\n\n== SQL ==\n\nselect \n   \n   resultCode.standard.primaryDisplay                                           as display,\n   ontologicalCategoryAliases                                                   as category\n   \n   from results \n---^^^\n   WHERE has_any_concept(resultCode, array([\' "CREATININE_QUANTITATIVE_24_HOUR_DIALYSIS_FLUID_OBSTYPE"   \', \' "CREATININE_QUANTITATIVE_24_HOUR_URINE_OBSTYPE"   \', \' "CREATININE_QUANTITATIVE_SERUM_OBSTYPE"   \']))\n   AND normalizedValue.typedValue.type = "NUMERIC" \n   AND interpretation.standard.primaryDisplay NOT IN (\'Not applicable\', \'Normal\')\n   \n   LIMIT 10\n'

I did not write the UDF has_any_concepts

Riley Schack · Accepted Answer · 2021-03-09 05:06:53Z

1

If you're using python 3.6+, the code can look a little cleaner if you use f-strings.

You can't directly pass a list to the array function within the SQL syntax.

spark.sql(
    f"""
    select 
       resultCode.standard.primaryDisplay as display,
       ontologicalCategoryAliases as category
    from results 
    WHERE has_any_concept(resultCode, array({", ".join([f"'{x}'" for x in concepts])}))
    LIMIT 3
    """
).toPandas()

answered Mar 9, 2021 at 5:06

Riley Schack

862 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pass an array into an SQL query using format in pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related