Export single pandas dataframe to multiple SQL tables (automatic normalization)

Question

I have a DataFrame like this, but with millions of rows and about 15 columns:

       id    name  col1   col2  total
0 8252552 CHARLIE DESC1 VALUE1   5.99
1 8252552 CHARLIE DESC1 VALUE2  20.00
2 5699881    JOHN DESC1 VALUE1  39.00
2 5699881    JOHN DESC2 VALUE3  -3.99

The DataFrame needs to be exported to a SQL database, in several tables. I'm currently using SQLite3, to test the functionality. The tables would be:

main (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, people_id INTEGER, col1_id INTEGER, col2_id INTEGER, total REAL)
people (id INTEGER NOT NULL PRIMARY KEY UNIQUE, name TEXT UNIQUE)
col1 (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, name TEXT UNIQUE)
col2 (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, name TEXT UNIQUE)

The main table should look similar to this:

  people_id col1_id col2_id  total
0   8252552       1       1   5.99
1   8252552       1       2  20.00
2   5699881       1       1  39.00
3   5699881       2       3  -3.99

Other tables, like "people", like this:

     id    name
8252552 CHARLIE
5699881    JOHN

Thing is, I can't find how to achieve that using the schema attribute of the to_sql method in pandas. Using Python, I'd do something like this:

conn = sqlite3.connect("main.db")
cur = conn.cursor()
for row in dataframe:
    id = row["ID"]
    name = row["Name"]
    col1 = row["col1"]
    col2 = row["col2"]
    total = row["total"]
    cur.execute("INSERT OR IGNORE INTO people (id, name) VALUES (?, ?)", (id, name))
    people_id = cur.fetchone()[0]
    cur.execute("INSERT OR IGNORE INTO col1 (col1) VALUES (?)", (col1, ))
    col1_id = cur.fetchone()[0]
    cur.execute("INSERT OR IGNORE INTO col1 (col2) VALUES (?)", (col2, ))
    col2_id = cur.fetchone()[0]
    cur.execute("INSERT OR REPLACE INTO main (people_id, col1_id, col2_id, total) VALUES (?, ?, ?, ?)", (people_id, col1_id, col2_id, total ))
conn.commit()

That would automatically add the corresponding values to the tables (people, col1 and col2), create a row with the desire values and foreign keys, and add that row to the primary table. However, there are a lot of columns and rows, and this might get very slow. Plus, I don't feel very confident that this is a "best practice" when dealing with databases (I'm fairly new to database development)

My question is: Is there a way to export a pandas DataFrame to multiple SQL Tables, setting the normalization rules, as in the above example? Is there any way to get the same result, with improved performance?

Bugface · Accepted Answer · 2020-09-02 01:26:59Z

1

Could you first split your Pandas data frame into several sub data frames according to the database tables, then apply the to_sql() method on each sub data frames?

answered Sep 2, 2020 at 1:26

Bugface

3132 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jose Vega Over a year ago

Yes, that would be one option. But how would I change the values and normalize it?

Collectives™ on Stack Overflow

Export single pandas dataframe to multiple SQL tables (automatic normalization)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related