0

I need to create the following report scalable:

query = """
(SELECT
    '02/11/2019' as Week_of,
    media_type,
    campaign,
    count(ad_start_ts) as frequency
FROM usotomayor.digital 
WHERE ds between 20190211 and 20190217
GROUP BY 1,2,3)
UNION ALL
(SELECT
    '02/18/2019' as Week_of,
    media_type,
    campaign,
    count(ad_start_ts) as frequency
FROM usotomayor.digital 
WHERE ds between 20190211 and 20190224
GROUP BY 1,2,3)


"""

#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2

However, as you can see I cannot make this report scalable if I have a long list of dates for each SQL query that I need to union.

My first attempt at looping in a list of date variables into the SQL script is as follows:

dfys = ['20190217','20190224']

df2 = ['02/11/2019','02/18/2019']

for i in df2:
    date=i

for j in dfys:
    date2=j

query = f"""
SELECT
    '{date}' as Week_of,
    raw.media_type,
    raw.campaign,
    count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw 
WHERE raw.ds between 20190211 and {date2}
GROUP BY 1,2,3

"""

#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2

However, this is not working for me. I think I need to loop through the sql query itself, but I don't know how to do this. Can someone help me?

3
  • "this is not working for me" — Perhaps you could be more specific. Commented Jul 19, 2019 at 20:07
  • When I run the code with the Python loop it only runs the last set of the variable array list. Commented Jul 19, 2019 at 20:27
  • indentions are incorrect so I can only say you have to do it inside for-loop, not after for-loop. You can create list with all SELECT and later concatenate them with "\nUNION ALL\n".join(list_with_all_SELECT) Commented Jul 19, 2019 at 20:40

1 Answer 1

3

As a commenter said "this is not working for me" is not very specific so let's start at specifying the problem. You need to execute a query for each pair of dates you need to execute these queries as a loop and save the result (or actually union them, but then you need to change your query logic).

You could do it like this:

dfys = ['20190217', '20190224']

df2 = ['02/11/2019', '02/18/2019']

query_results = list()
for start_date, end_date in zip(dfys, df2):
    query = f"""
    SELECT
        '{start_date}' as Week_of,
        raw.media_type,
        raw.campaign,
        count(raw.ad_start_ts) as frequency
    FROM usotomayor.digital raw 
    WHERE raw.ds between 20190211 and {end_date}
    GROUP BY 1,2,3

    """
    query_results.append(spark.sql(query).toPandas())

query_results[0]
query_results[1]

Now you get a list of your results (query_results).

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you SO much for this answer! After more than a dozen searches, I finally stumbled upon it and it was exactly what I needed! You’re so helpful and I appreciate your clear, correct, and applicable answer!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.