5

I am trying to insert data from xlsx file into mysqdl table. I want to insert data in table and if there is a duplicate on primary keys, I want to update the existing data otherwise insert. I have written the script already but I realized it is too much work and using pandas it is quick. How can I achieve it in pandas?

#!/usr/bin/env python3

import pandas as pd
import sqlalchemy

engine_str = 'mysql+pymysql://admin:mypass@localhost/mydb'
engine = sqlalchemy.create_engine(engine_str, echo=False, encoding='utf-8')\

file_name = "tmp/results.xlsx"
df = pd.read_excel(file_name)

3 Answers 3

10

I can think of two options, but number 1 might be cleaner/faster:

1) Make SQL decide on the update/insert. Check this other question. You can iterate by rows of your 'df', from i=1 to n. Inside the loop for the insertion you can write something like:

query = """INSERT INTO table (id, name, age) VALUES(%s, %s, %s)
ON DUPLICATE KEY UPDATE name=%s, age=%s"""
engine.execute(query, (df.id[i], df.name[i], df.age[i], df.name[i], df.age[i]))

2) Define a python function that returns True or False when the record exists and then use it in your loop:

def check_existence(user_id):
    query = "SELECT EXISTS (SELECT 1 FROM your_table where user_id_str = %s);"
    return list(engine.execute(query,  (user_id, ) ) )[0][0] == 1

You could iterate over rows and do this check before inserting

Please also check the solution in this question and this one too which might work in your case.

Sign up to request clarification or add additional context in comments.

3 Comments

The reason I am trying to avoid SQL query itself is because the excel I receive often has column values which are to be escaped before I can create a query and insert into database. What I was hoping for since pandas handle the column value very efficiently, if there is a way to use pandas to insert data directly rather than executing sql query that would help me bypassing the escaping of random values(or garbage data).
@r0xette I don't get why you can't use an SQL query. Are you sure you can't change the column names or something? In any case, maybe using sqlalchemy will allow you to do this avoiding explicit SQL string queries. AFAIK, Pandas doesn't have a native way of inserting vs updating rows
I am using SQL query now by escaping the double quotes etc as they come. What I was hoping for if there was a way to perform upsert natively so the bad data could be escaped automatically.
1

Pangres is the tool for this job.

Overview here: https://pypi.org/project/pangres/

Use the function pangres.fix_psycopg2_bad_cols to "clean" the columns in the DataFrame.

Code/usage here: https://github.com/ThibTrip/pangres/wiki https://github.com/ThibTrip/pangres/wiki/Fix-bad-column-names-postgres Example code:

# From: <https://github.com/ThibTrip/pangres/wiki/Fix-bad-column-names-postgres>
import pandas as pd

# fix bad col/index names with default replacements (empty string for '(', ')' and '%'):

df = pd.DataFrame({'test()':[0],
                   'foo()%':[0]}).set_index('test()')
print(df)

test()  foo()%
     0      0

# clean cols, index w/ no replacements
df_fixed = fix_psycopg2_bad_cols(df)

print(df_fixed)

test    foo
   0      0

# fix bad col/index names with custom replacements - you MUST provide replacements for '(', ')' and '%': 

# reset df
df = pd.DataFrame({'test()':[0],
                   'foo()%':[0]}).set_index('test()')

# clean cols, index w/ user-specified replacements
df_fixed = fix_psycopg2_bad_cols(df, replacements={'%':'percent', '(':'', ')':''})

print(df_fixed)
test    foopercent
   0             0

Will only fix/correct some of the bad characters:

Replaces '%', '(' and ')' (characters that won't play nicely or even at all)

But, useful in that it handles cleanup and upsert.

(p.s., I know this post is over 4 years old, but still shows up in Google results when searching for "pangres upsert determine number inserts and updates" as the top SO result, dated May 13, 2020.)

2 Comments

For the code/usage section, could you provide examples from that link in the answer itself? Links are unreliable and may break, but if you provide the relevant pieces of the linked examples, this answer will stand the test of time a bit better.
@gallen - code added from source/link as requested
-4

When using Pandas no iteration is needed. Isn't that faster?

df = pd.read_csv(csv_file,sep=';',names=['column'])
    
df.to_sql('table', con=con, if_exists='append', index=False, chunksize=20000)

1 Comment

The if_exists='overwrite' here will drop the table, so all data other than your df will be lost in the original table. I think the questions is how to keep the not conflict data in the database table but only update the conflict ones.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.