5

This is probably a highly discussed topic, but i have not found "the answer" yet. I am inserting big tables into Azure SQL Server monthly. I process the raw data in memory with python and Pandas. I really like the speed and versatility of Pandas.

Sample DataFrame size = 5.2 million rows, 50 columns, 250 MB memory allocated

Transferring the processed Pandas DataFrame to Azure SQL Server is always the bottleneck. For data transfer, I used to_sql (with sqlalchemy). I tried fast_executemany, various chunk sizes etc arguments.

The fastest way I found so far is to export the DataFrame to a csv file, then BULK INSERT that into SQL server using either SSMS, bcp, Azure Blob etc.

However, i am looking into doing this bypassing the csv file creation, since my df has all the dtypes set, already loaded in memory.

What is your fastest means of transfer this df to SQL Server, utilizing solely python/Pandas? I am also interested in solutions like using binary file transfer etc. - as long as I eliminate flat file export/import.

Thanks

1 Answer 1

4

I had a similar issue, and I resolved it using a BCP utility. The basic description of the bottleneck issue is that it seems to be using RBAR data entry, as in Row-By-Agonizing-Row inserts, i.e. one insert statement/record. Going the bulk insert route has saved me a lot of time. The real benefit seemed to come once I crossed the threshold of 1M+ records, which you seem to well ahead of.

Link to utility:https://github.com/yehoshuadimarsky/bcpandas

Sign up to request clarification or add additional context in comments.

1 Comment

5204251 rows copied. Network packet size (bytes): 4096 Clock Time (ms.) Total : 341391 Average : (15244.25 rows per sec.) Printout and ease of use are added benefits!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.