I have a DataFrame like this, but with millions of rows and about 15 columns:
id name col1 col2 total
0 8252552 CHARLIE DESC1 VALUE1 5.99
1 8252552 CHARLIE DESC1 VALUE2 20.00
2 5699881 JOHN DESC1 VALUE1 39.00
2 5699881 JOHN DESC2 VALUE3 -3.99
The DataFrame needs to be exported to a SQL database, in several tables. I'm currently using SQLite3, to test the functionality. The tables would be:
- main (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, people_id INTEGER, col1_id INTEGER, col2_id INTEGER, total REAL) - people (
id INTEGER NOT NULL PRIMARY KEY UNIQUE, name TEXT UNIQUE) - col1 (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, name TEXT UNIQUE) - col2 (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, name TEXT UNIQUE)
The main table should look similar to this:
people_id col1_id col2_id total
0 8252552 1 1 5.99
1 8252552 1 2 20.00
2 5699881 1 1 39.00
3 5699881 2 3 -3.99
Other tables, like "people", like this:
id name
8252552 CHARLIE
5699881 JOHN
Thing is, I can't find how to achieve that using the schema attribute of the to_sql method in pandas. Using Python, I'd do something like this:
conn = sqlite3.connect("main.db")
cur = conn.cursor()
for row in dataframe:
id = row["ID"]
name = row["Name"]
col1 = row["col1"]
col2 = row["col2"]
total = row["total"]
cur.execute("INSERT OR IGNORE INTO people (id, name) VALUES (?, ?)", (id, name))
people_id = cur.fetchone()[0]
cur.execute("INSERT OR IGNORE INTO col1 (col1) VALUES (?)", (col1, ))
col1_id = cur.fetchone()[0]
cur.execute("INSERT OR IGNORE INTO col1 (col2) VALUES (?)", (col2, ))
col2_id = cur.fetchone()[0]
cur.execute("INSERT OR REPLACE INTO main (people_id, col1_id, col2_id, total) VALUES (?, ?, ?, ?)", (people_id, col1_id, col2_id, total ))
conn.commit()
That would automatically add the corresponding values to the tables (people, col1 and col2), create a row with the desire values and foreign keys, and add that row to the primary table. However, there are a lot of columns and rows, and this might get very slow. Plus, I don't feel very confident that this is a "best practice" when dealing with databases (I'm fairly new to database development)
My question is: Is there a way to export a pandas DataFrame to multiple SQL Tables, setting the normalization rules, as in the above example? Is there any way to get the same result, with improved performance?