1

I have several SQL queries written in MS SQL Server and I used the following code to import them into Python using the pyodbc package.

import pyodbc
import pandas as pd 

def conn_sql_server(file_path):
    '''Function to connect to SQL Server and save query result to a dataframe
        input:
            file_path - query file path
        output:
        df - dataframe from the query result
    '''

    # Connect to SQL Server
    conn = pyodbc.connect('Driver= {SQL Server Native Client 11.0};'
                      'Server= servername;'
                      'Database = databasename;'
                      'Trusted_Connection=yes;')

    # run query and ouput the result to df
    query = open(file_path, 'r') 
    df = pd.read_sql_query(query.read(), conn)
    query.close() 

    return df   

df1 = conn_sql_server('C:/Users/JJ/SQL script1')
df2 = conn_sql_server('C:/Users/JJ/SQL script2')
df3 = conn_sql_server('C:/Users/JJ/SQL script3')

In each SQL query, I have used DECLARE and SET to set the variables (variables are different in each SQL query). Here, I just copied a random query from online as an example. What I want to do is to update the Year variable directly in Python. My actual query is pretty long, so I don't want to copy over the SQL scripts in Python, I just want to adjust the variables. Any way to do it?

DECLARE @Year INT = 2022;
SELECT YEAR(date) @Year, 
       SUM(list_price * quantity) gross_sales
FROM sales.orders o
     INNER JOIN sales.order_items i ON i.order_id = o.order_id
GROUP BY YEAR(date)
order by @Year

My other question is, is there any way to add a WHERE statement, like WHERE itemNumber = 1002345 after importing the above query into Python? I'm asking this because df2 is a subset of df1. The restriction column isn't selected to show in the output, so I cannot do filterings in Python after reading in df1. I could add that column in the df1 output and do more aggregations in Python, but that would largely increase the original data size and running time, so I prefer not to do it.

4
  • 1
    You are reading the entire file such as filepath1 into memory using Python. You can modify the declare/set lines to your liking using Python. Similarly, you can add a where clause in Python also. If you edit your answer and add a sample of filepath1 or filepath2 and how you want the query to be changed, I can try to assist. Commented May 2, 2022 at 4:46
  • 2
    Thank you, @zedfoxus! I've edited my post with more details. Commented May 3, 2022 at 5:07
  • 2
    Remove the DECLARE and use a parameter for @Year? (I don't know how to do parameterized statements in pd, but doubtlessly it's possible.) Similarly, for dynamic WHEREs use something like WHERE @itemNumber IS NULL OR itemNumber = @itemNumber, then set the @itemNumber parameter as appropriate. OPTION (RECOMPILE) comes in handy there to keep query plans that perform well. If at all possible you should avoid trying to tamper with the query text, as that's much more error prone. Commented May 3, 2022 at 6:30
  • 1
    You can't update variables in Python or SQL outside their method/scope. You use parameters for this. read_sql has a params argument that's used to pass parameter values by name. PyODBC doesn't support named parameters, so you'll have to use anonymous ones. If your query is eg select * from SomeTable where ID=?' you could use .read_sql(query,params=(123)). Commented May 3, 2022 at 12:34

1 Answer 1

1

Here's a sample of how your script will look like. We are doing 2 modifications:

  • conn_sql_server now takes these parameters:

    • year: you can pass the year you want to replace declare @year...
    • where_clause: a where clause of your choice
    • before_clause_starts_with: the clause before which the where clause should be placed
  • modify_query method that reads the contents of the file and changes the content based on the year you provided. If you provide the where clause, it'll put it before the clause you provide in before_clause_starts_with

import pyodbc
import pandas as pd 

def modify_query(lines, year, where_clause, before_clause_starts_with):
    new_lines = []

    for line in lines:

        if year is not None:
            if line.lower().startswith('declare @year int ='):
                new_lines.append(f"DECLARE @Year INT = {year}\n")
                continue

        if where_clause is not None:
            if line.lower().startswith(before_clause_starts_with.lower()):
                new_lines.append(where_clause + "\n")
                new_lines.append(line)
                continue

        new_lines.append(line)

    new_query = ''.join(new_lines)
    return new_query


def conn_sql_server(file_path, year=None, where_clause=None, before_clause_starts_with=None):
    '''Function to connect to SQL Server and save query result to a dataframe
        input:
            file_path - query file path
        output:
        df - dataframe from the query result
    '''

    # Connect to SQL Server
    conn = pyodbc.connect('Driver= {SQL Server Native Client 11.0};'
                      'Server= servername;'
                      'Database = databasename;'
                      'Trusted_Connection=yes;')

    # run query and ouput the result to df
    query = open(file_path, 'r')
    lines = query.readlines()
    query.close()

    new_query = modify_query(lines, year, where_clause, before_clause_starts_with)

    df = pd.read_sql_query(new_query, conn)
    return df   

df1 = conn_sql_server('C:/Users/JJ/SQL script1', 
          year=1999,
          where_clause='WHERE itemNumber = 1002345', 
          before_clause_starts_with='group by')

df2 = conn_sql_server('C:/Users/JJ/SQL script2')

df3 = conn_sql_server('C:/Users/JJ/SQL script3',
          year = 1500)

Simulation

Let's run an example.

script1.sql

DECLARE @Year INT = 2022;
SELECT YEAR(date) @Year, 
       SUM(list_price * quantity) gross_sales
FROM sales.orders o
     INNER JOIN sales.order_items i ON i.order_id = o.order_id
GROUP BY YEAR(date)
order by @Year

script2.sql

DECLARE @Year INT = 2022;
SELECT gross_sales
FROM sales.orders
order by @Year

script3.sql

DECLARE @Year INT = 2022;
SELECT GETDATE()

Using a script similar to the above, we'll try to see how each script looks like after it gets modified.

Simulation script

#import pyodbc
#import pandas as pd 

def modify_query(lines, year, where_clause, before_clause_starts_with):
    new_lines = []

    print('-------')
    print('ORIGINAL')
    print('-------')
    print(lines)

    for line in lines:

        if year is not None:
            if line.lower().startswith('declare @year int ='):
                new_lines.append(f"DECLARE @Year INT = {year}\n")
                continue

        if where_clause is not None:
            if line.lower().startswith(before_clause_starts_with.lower()):
                new_lines.append(where_clause + "\n")
                new_lines.append(line)
                continue

        new_lines.append(line)


    print('-------')
    print('NEW')
    print('-------')
    new_query = ''.join(new_lines)
    print(new_query)

    return new_query


def conn_sql_server(file_path, year=None, where_clause=None, before_clause_starts_with=None):
    '''Function to connect to SQL Server and save query result to a dataframe
        input:
            file_path - query file path
        output:
        df - dataframe from the query result
    '''

    # Connect to SQL Server
    #conn = pyodbc.connect('Driver= {SQL Server Native Client 11.0};'
    #                  'Server= servername;'
    #                  'Database = databasename;'
    #                  'Trusted_Connection=yes;')

    # run query and ouput the result to df
    query = open(file_path, 'r')
    lines = query.readlines()
    query.close()

    new_query = modify_query(lines, year, where_clause, before_clause_starts_with)

    #df = pd.read_sql_query(new_query, conn)
    #return df   

#df1 = conn_sql_server('C:/Users/JJ/SQL script1')
#df2 = conn_sql_server('C:/Users/JJ/SQL script2')
#df3 = conn_sql_server('C:/Users/JJ/SQL script3')

df1 = conn_sql_server('script1.sql', year=1999, where_clause='WHERE itemNumber = 1002345', before_clause_starts_with='group by')
df2 = conn_sql_server('script2.sql')
df3 = conn_sql_server('script3.sql', year=1500)

Original query 1 was like this in script1.sql

['DECLARE @Year INT = 2022;\n', 'SELECT YEAR(date) @Year, \n', '       SUM(list_price * quantity) gross_sales\n', 'FROM sales.orders o\n', '     INNER JOIN sales.order_items i ON i.order_id = o.order_id\n', 'GROUP BY YEAR(date)\n', 'order by @Year']

After running the script, the query will become

DECLARE @Year INT = 1999
SELECT YEAR(date) @Year, 
       SUM(list_price * quantity) gross_sales
FROM sales.orders o
     INNER JOIN sales.order_items i ON i.order_id = o.order_id
WHERE itemNumber = 1002345
GROUP BY YEAR(date)
order by @Year

Query 3 used to look like this:

['DECLARE @Year INT = 2022;\n', 'SELECT GETDATE()']

It becomes

DECLARE @Year INT = 1500
SELECT GETDATE()

Give it a shot by changing the python script as you deem fit.

Sign up to request clarification or add additional context in comments.

5 Comments

It's worth pointing out that string interpolation approaches like this are sensitive to SQL injection, so you have to be sure the input comes from a trusted source. As written it's trivial to pass a year of 0; DROP TABLE sales.orders;-- and potentially end up with a really bad day.
@zedfoxus Thanks for your answer. However, as I said it takes a long time already to run the SQL script for df1, so using for loop would largely increase the time. Plus, I also mentioned that I have different vairables to change for different queries. For df1 and df2, I am changing year, but for df3, it's a completely different query and I am changing two other variables. I am looking for a more feasible answer that can be used in any kind of queries without increasing the running time.
@Jiamei without knowing what your queries are, table structure, why they are slow, indexes on the table etc., it's going to be difficult to give you solutions that will improve performance. Does sales.orders have an index on date field? If you intend to add itemnumber in the where clause, you might benefit from having an index on date and itemnumber. Once you do that, see how your queries run.
Thanks for your reply. Due to confidential reasons, I cannot share the original code here, therefore the SQL query I showed is just a random query I copied from the web, but it should be the same idea. The query is optimized already, it's just the size of the data that caused the long running time.
Without sufficient data, I doubt anyone will be able to help you speed queries up. You might have to work with your DBA or equivalent to optimize the queries. SQL Server is very powerful. You can tune queries to run very fast even with large datasets. The code above is an example of how you can change existing queries in SQL files before they are executed using the information available through the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.