Insert or update if exists in mysql using pandas

Question

I am trying to insert data from xlsx file into mysqdl table. I want to insert data in table and if there is a duplicate on primary keys, I want to update the existing data otherwise insert. I have written the script already but I realized it is too much work and using pandas it is quick. How can I achieve it in pandas?

#!/usr/bin/env python3

import pandas as pd
import sqlalchemy

engine_str = 'mysql+pymysql://admin:mypass@localhost/mydb'
engine = sqlalchemy.create_engine(engine_str, echo=False, encoding='utf-8')\

file_name = "tmp/results.xlsx"
df = pd.read_excel(file_name)

Community · Accepted Answer · 2017-05-23 12:19:16Z

10

I can think of two options, but number 1 might be cleaner/faster:

1) Make SQL decide on the update/insert. Check this other question. You can iterate by rows of your 'df', from i=1 to n. Inside the loop for the insertion you can write something like:

query = """INSERT INTO table (id, name, age) VALUES(%s, %s, %s)
ON DUPLICATE KEY UPDATE name=%s, age=%s"""
engine.execute(query, (df.id[i], df.name[i], df.age[i], df.name[i], df.age[i]))

2) Define a python function that returns True or False when the record exists and then use it in your loop:

def check_existence(user_id):
    query = "SELECT EXISTS (SELECT 1 FROM your_table where user_id_str = %s);"
    return list(engine.execute(query,  (user_id, ) ) )[0][0] == 1

You could iterate over rows and do this check before inserting

Please also check the solution in this question and this one too which might work in your case.

edited May 23, 2017 at 12:19

CommunityBot

11 silver badge

answered Oct 24, 2016 at 21:39

cd98

3,5522 gold badges37 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

r0xette Over a year ago

The reason I am trying to avoid SQL query itself is because the excel I receive often has column values which are to be escaped before I can create a query and insert into database. What I was hoping for since pandas handle the column value very efficiently, if there is a way to use pandas to insert data directly rather than executing sql query that would help me bypassing the escaping of random values(or garbage data).

cd98 Over a year ago

@r0xette I don't get why you can't use an SQL query. Are you sure you can't change the column names or something? In any case, maybe using sqlalchemy will allow you to do this avoiding explicit SQL string queries. AFAIK, Pandas doesn't have a native way of inserting vs updating rows

r0xette Over a year ago

I am using SQL query now by escaping the double quotes etc as they come. What I was hoping for if there was a way to perform upsert natively so the bad data could be escaped automatically.

RUDelerius · Accepted Answer · 2020-12-23 16:23:59Z

1

Pangres is the tool for this job.

Overview here: https://pypi.org/project/pangres/

Use the function pangres.fix_psycopg2_bad_cols to "clean" the columns in the DataFrame.

Code/usage here: https://github.com/ThibTrip/pangres/wiki https://github.com/ThibTrip/pangres/wiki/Fix-bad-column-names-postgres Example code:

# From: <https://github.com/ThibTrip/pangres/wiki/Fix-bad-column-names-postgres>
import pandas as pd

# fix bad col/index names with default replacements (empty string for '(', ')' and '%'):

df = pd.DataFrame({'test()':[0],
                   'foo()%':[0]}).set_index('test()')
print(df)

test()  foo()%
     0      0

# clean cols, index w/ no replacements
df_fixed = fix_psycopg2_bad_cols(df)

print(df_fixed)

test    foo
   0      0

# fix bad col/index names with custom replacements - you MUST provide replacements for '(', ')' and '%': 

# reset df
df = pd.DataFrame({'test()':[0],
                   'foo()%':[0]}).set_index('test()')

# clean cols, index w/ user-specified replacements
df_fixed = fix_psycopg2_bad_cols(df, replacements={'%':'percent', '(':'', ')':''})

print(df_fixed)
test    foopercent
   0             0

Will only fix/correct some of the bad characters:

Replaces '%', '(' and ')' (characters that won't play nicely or even at all)

But, useful in that it handles cleanup and upsert.

(p.s., I know this post is over 4 years old, but still shows up in Google results when searching for "pangres upsert determine number inserts and updates" as the top SO result, dated May 13, 2020.)

edited Dec 23, 2020 at 16:23

answered Dec 18, 2020 at 18:44

RUDelerius

494 bronze badges

2 Comments

gallen Over a year ago

For the code/usage section, could you provide examples from that link in the answer itself? Links are unreliable and may break, but if you provide the relevant pieces of the linked examples, this answer will stand the test of time a bit better.

RUDelerius Over a year ago

@gallen - code added from source/link as requested

wfolkerts · Accepted Answer · 2021-06-22 09:29:13Z

-4

When using Pandas no iteration is needed. Isn't that faster?

df = pd.read_csv(csv_file,sep=';',names=['column'])
    
df.to_sql('table', con=con, if_exists='append', index=False, chunksize=20000)

edited Jun 22, 2021 at 9:29

answered May 13, 2020 at 14:24

wfolkerts

1171 silver badge4 bronze badges

1 Comment

Yingbo Miao Over a year ago

The if_exists='overwrite' here will drop the table, so all data other than your df will be lost in the original table. I think the questions is how to keep the not conflict data in the database table but only update the conflict ones.

Collectives™ on Stack Overflow

Insert or update if exists in mysql using pandas

3 Answers 3

3 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related