10

Python Version - 2.7.6

Pandas Version - 0.17.1

MySQLdb Version - 1.2.5

In my database ( PRODUCT ) , I have a table ( XML_FEED ). The table XML_FEED is huge ( Millions of record ) I have a pandas.DataFrame() ( PROCESSED_DF ). The dataframe has thousands of rows.

Now I need to run this

REPLACE INTO TABLE PRODUCT.XML_FEED
(COL1, COL2, COL3, COL4, COL5),
VALUES (PROCESSED_DF.values)

Question:-

Is there a way to run REPLACE INTO TABLE in pandas? I already checked pandas.DataFrame.to_sql() but that is not what I need. I do not prefer to read XML_FEED table in pandas because it very huge.

4 Answers 4

18

With the release of pandas 0.24.0, there is now an official way to achieve this by passing a custom insert method to the to_sql function.

I was able to achieve the behavior of REPLACE INTO by passing this callable to to_sql:

def mysql_replace_into(table, conn, keys, data_iter):
    from sqlalchemy.dialects.mysql import insert
    from sqlalchemy.ext.compiler import compiles
    from sqlalchemy.sql.expression import Insert

    @compiles(Insert)
    def replace_string(insert, compiler, **kw):
        s = compiler.visit_insert(insert, **kw)
        s = s.replace("INSERT INTO", "REPLACE INTO")
        return s

    data = [dict(zip(keys, row)) for row in data_iter]

    conn.execute(table.table.insert(replace_string=""), data)

You would pass it like so:

df.to_sql(db, if_exists='append', method=mysql_replace_into)

Alternatively, if you want the behavior of INSERT ... ON DUPLICATE KEY UPDATE ... instead, you can use this:

def mysql_replace_into(table, conn, keys, data_iter):
    from sqlalchemy.dialects.mysql import insert

    data = [dict(zip(keys, row)) for row in data_iter]

    stmt = insert(table.table).values(data)
    update_stmt = stmt.on_duplicate_key_update(**dict(zip(stmt.inserted.keys(), 
                                               stmt.inserted.values())))

    conn.execute(update_stmt)

Credits to https://stackoverflow.com/a/11762400/1919794 for the compile method.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, this works exactly how I want. Just wish pandas could add a some option for this as well...
Thanks devull. This solution used to work for me just fine, except when I updated my system and installed the latest Python, Pandas, SQLAlchemy, etc. Now, I get the following error: TypeError: TableClause.insert() got an unexpected keyword argument 'replace_string
2

Till this version (0.17.1) I am unable find any direct way to do this in pandas. I reported a feature request for the same. I did this in my project with executing some queries using MySQLdb and then using DataFrame.to_sql(if_exists='append')

Suppose

1) product_id is my primary key in table PRODUCT

2) feed_id is my primary key in table XML_FEED.

SIMPLE VERSION

import MySQLdb
import sqlalchemy
import pandas

con = MySQLdb.connect('localhost','root','my_password', 'database_name')
con_str = 'mysql+mysqldb://root:my_password@localhost/database_name'
engine = sqlalchemy.create_engine(con_str) #because I am using mysql
df = pandas.read_sql('SELECT * from PRODUCT', con=engine)
df_product_id = df['product_id']
product_id_str = (str(list(df_product_id.values))).strip('[]')
delete_str = 'DELETE FROM XML_FEED WHERE feed_id IN ({0})'.format(product_id_str)
cur = con.cursor()
cur.execute(delete_str)
con.commit()
df.to_sql('XML_FEED', if_exists='append', con=engine)# you can use flavor='mysql' if you do not want to create sqlalchemy engine but it is depreciated

Please note:- The REPLACE [INTO] syntax allows us to INSERT a row into a table, except that if a UNIQUE KEY (including PRIMARY KEY) violation occurs, the old row is deleted prior to the new INSERT, hence no violation.

Comments

2

I needed a generic solution to this problem, so I built on shiva's answer--maybe it will be helpful to others. This is useful in situations where you grab a table from a MySQL database (whole or filtered), update/add some rows, and want to perform a REPLACE INTO statement with df.to_sql().

It finds the table's primary keys, performs a delete statement on the MySQL table with all keys from the pandas dataframe, and then inserts the dataframe into the MySQL table.

def to_sql_update(df, engine, schema, table):
    df.reset_index(inplace=True)
    sql = ''' SELECT column_name from information_schema.columns
              WHERE table_schema = '{schema}' AND table_name = '{table}' AND
                    COLUMN_KEY = 'PRI';
          '''.format(schema=schema, table=table)
    id_cols = [x[0] for x in engine.execute(sql).fetchall()]
    id_vals = [df[col_name].tolist() for col_name in id_cols]
    sql = ''' DELETE FROM {schema}.{table} WHERE 0 '''.format(schema=schema, table=table)
    for row in zip(*id_vals):
        sql_row = ' AND '.join([''' {}='{}' '''.format(n, v) for n, v in zip(id_cols, row)])
        sql += ' OR ({}) '.format(sql_row)
    engine.execute(sql)
    
    df.to_sql(table, engine, schema=schema, if_exists='append', index=False)

4 Comments

This works great, thank you. However I removed line 2 because I don't think it's required, and with it you are left with an extra column 'index' which will of course cause an error - unless you meant to add df.drop(['index'], axis=1, inplace=True).
That's a good point; the second line is only needed if the df has an index set on one or more columns.
I am not able to understand about name variable ? can you help me . df.to_sql(name, engine, schema=schema, if_exists='append', index=False)
That was a typo, it should be df.to_sql(table, engine ...). I fixed it in the answer.
-6

If you use to_sql you should be able to define it so that you replace values if they exist, so for a table named 'mydb' and a dataframe named 'df', you'd use:

df.to_sql(mydb,if_exists='replace')

That should replace values if they already exist, but I am not 100% sure if that's what you're looking for.

2 Comments

if_exist works for table not for rows in table. if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’ fail: If table exists, do nothing. replace: If table exists, drop it, recreate it, and insert data. append: If table exists, insert data. Create if does not exist
His answer is still valid as long as it is mentioned that it replace with the whole dataframe. If his command is preceded by a filter on the df then it is an appropriate one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.