How to speed up loading data from oracle sql to pandas df

Question

My code looks like this, i use pd.DataFrame.from_records to fill data into the dataframe, but it takes Wall time: 1h 40min 30s to process the request and load data from the sql table with 22 mln rows into df.

# I skipped some of the code, since there are no problems with the extract of the query, it's fast
cur = con.cursor()

def db_select(query): # takes the request text and sends it to the data_frame
    cur.execute(query)
    col = [column[0].lower() for column in cur.description] # parse headers
    df = pd.DataFrame.from_records(cur, columns=col) # fill the data into the dataframe
    return df

Then I pass the sql query to the function:

frame = db_select("select * from table")

How can i optimize code for speed up process?

That sounds like a lot of data to process (22M rows). You've got several things working against you: you're selecting all of the data, so there are no indexes in play to speed up your query. Full tablescan (and all the server i/o that goes with it) will be the likely result. Then you have to push all of that over the network and cache it (possibly more than once) in the application. That's a lot of context switches and (I would guess) memory, and possibly even swap I/O involved. What resource bottlenecks have you observed on the servers or the network? — pmdba
– pmdba, Commented Dec 6, 2020 at 1:12
I'm not familiar with oracle, but I recall connection to postgres was slow; I think you can generate a new connection and write each row individually if you're not careful — anon01
– anon01, Commented Dec 6, 2020 at 1:17
You could try dd.read_sql_table() as in dask (pandas big data big brother) instead of pandas. pip install dask and import dask.dataframe as dd — David Erickson
– David Erickson, Commented Dec 6, 2020 at 5:49

Barbaros Özhan · Accepted Answer · 2021-02-09 12:13:45Z

4

Setting proper value for cur.arraysize might help for tuning fetch performance . You need to determine the most suitable value for it. The default value is 100. A code with a different array sizes might be run in order to determine that value such as

arr=[100,1000,10000,100000,1000000]
for size in arr:
        try:
            cur.prefetchrows = 0
            cur.arraysize = size
            start = datetime.now()
            cur.execute("SELECT * FROM mytable").fetchall()
            elapsed = datetime.now() - start
            print("Process duration for arraysize ", size," is ", elapsed, " seconds")
        except Exception as err:
            print("Memory Error ", err," for arraysize ", size)

and then set such as cur.arraysize = 10000 before calling db_select from your original code

edited Feb 9, 2021 at 12:13

answered Dec 6, 2020 at 12:22

Barbaros Özhan

65.9k11 gold badges36 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kurasao Over a year ago

testing df with much smaller shape (4816566, 6) [rows, columns] showed the following results: with setting cur.arraysize by default it takes Wall time: 26min 59s with setting cur.arraysize = 1000 it takes Wall time: 7min 49s with setting cur.arraysize = 10000 it takes Wall time: 4min 42s with setting cur.arraysize = 100000 it takes Wall time: 4min 32s **with setting cur.arraysize = 1000000 it takes Wall time: 4min 33s** significant speed boosting is evident, thanks a lot for your help!

NatalieL Over a year ago

in my case setting prefetchrows as high as possible gives the best performance. I have about 2M rows and 30 columns and it took 1 minute with prefetchrows=2000000 and arraysize=1000.

Collectives™ on Stack Overflow

How to speed up loading data from oracle sql to pandas df

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related