2

My code looks like this, i use pd.DataFrame.from_records to fill data into the dataframe, but it takes Wall time: 1h 40min 30s to process the request and load data from the sql table with 22 mln rows into df.

# I skipped some of the code, since there are no problems with the extract of the query, it's fast
cur = con.cursor()

def db_select(query): # takes the request text and sends it to the data_frame
    cur.execute(query)
    col = [column[0].lower() for column in cur.description] # parse headers
    df = pd.DataFrame.from_records(cur, columns=col) # fill the data into the dataframe
    return df

Then I pass the sql query to the function:

frame = db_select("select * from table")

How can i optimize code for speed up process?

5
  • 1
    That sounds like a lot of data to process (22M rows). You've got several things working against you: you're selecting all of the data, so there are no indexes in play to speed up your query. Full tablescan (and all the server i/o that goes with it) will be the likely result. Then you have to push all of that over the network and cache it (possibly more than once) in the application. That's a lot of context switches and (I would guess) memory, and possibly even swap I/O involved. What resource bottlenecks have you observed on the servers or the network? Commented Dec 6, 2020 at 1:12
  • have you tried pd.read_sql? Commented Dec 6, 2020 at 1:15
  • 2
    I'm not familiar with oracle, but I recall connection to postgres was slow; I think you can generate a new connection and write each row individually if you're not careful Commented Dec 6, 2020 at 1:17
  • also, 22m is a lot of rows :) Commented Dec 6, 2020 at 1:17
  • 1
    You could try dd.read_sql_table() as in dask (pandas big data big brother) instead of pandas. pip install dask and import dask.dataframe as dd Commented Dec 6, 2020 at 5:49

1 Answer 1

4

Setting proper value for cur.arraysize might help for tuning fetch performance . You need to determine the most suitable value for it. The default value is 100. A code with a different array sizes might be run in order to determine that value such as

arr=[100,1000,10000,100000,1000000]
for size in arr:
        try:
            cur.prefetchrows = 0
            cur.arraysize = size
            start = datetime.now()
            cur.execute("SELECT * FROM mytable").fetchall()
            elapsed = datetime.now() - start
            print("Process duration for arraysize ", size," is ", elapsed, " seconds")
        except Exception as err:
            print("Memory Error ", err," for arraysize ", size) 

and then set such as cur.arraysize = 10000 before calling db_select from your original code

Sign up to request clarification or add additional context in comments.

2 Comments

testing df with much smaller shape (4816566, 6) [rows, columns] showed the following results: with setting cur.arraysize by default it takes Wall time: 26min 59s with setting cur.arraysize = 1000 it takes Wall time: 7min 49s with setting cur.arraysize = 10000 it takes Wall time: 4min 42s with setting cur.arraysize = 100000 it takes Wall time: 4min 32s **with setting cur.arraysize = 1000000 it takes Wall time: 4min 33s** significant speed boosting is evident, thanks a lot for your help!
in my case setting prefetchrows as high as possible gives the best performance. I have about 2M rows and 30 columns and it took 1 minute with prefetchrows=2000000 and arraysize=1000.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.