4

I have a DataFrame like this, but with millions of rows and about 15 columns:

       id    name  col1   col2  total
0 8252552 CHARLIE DESC1 VALUE1   5.99
1 8252552 CHARLIE DESC1 VALUE2  20.00
2 5699881    JOHN DESC1 VALUE1  39.00
2 5699881    JOHN DESC2 VALUE3  -3.99

The DataFrame needs to be exported to a SQL database, in several tables. I'm currently using SQLite3, to test the functionality. The tables would be:

  • main (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, people_id INTEGER, col1_id INTEGER, col2_id INTEGER, total REAL)
  • people (id INTEGER NOT NULL PRIMARY KEY UNIQUE, name TEXT UNIQUE)
  • col1 (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, name TEXT UNIQUE)
  • col2 (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, name TEXT UNIQUE)

The main table should look similar to this:

  people_id col1_id col2_id  total
0   8252552       1       1   5.99
1   8252552       1       2  20.00
2   5699881       1       1  39.00
3   5699881       2       3  -3.99

Other tables, like "people", like this:

     id    name
8252552 CHARLIE
5699881    JOHN

Thing is, I can't find how to achieve that using the schema attribute of the to_sql method in pandas. Using Python, I'd do something like this:

conn = sqlite3.connect("main.db")
cur = conn.cursor()
for row in dataframe:
    id = row["ID"]
    name = row["Name"]
    col1 = row["col1"]
    col2 = row["col2"]
    total = row["total"]
    cur.execute("INSERT OR IGNORE INTO people (id, name) VALUES (?, ?)", (id, name))
    people_id = cur.fetchone()[0]
    cur.execute("INSERT OR IGNORE INTO col1 (col1) VALUES (?)", (col1, ))
    col1_id = cur.fetchone()[0]
    cur.execute("INSERT OR IGNORE INTO col1 (col2) VALUES (?)", (col2, ))
    col2_id = cur.fetchone()[0]
    cur.execute("INSERT OR REPLACE INTO main (people_id, col1_id, col2_id, total) VALUES (?, ?, ?, ?)", (people_id, col1_id, col2_id, total ))
conn.commit()

That would automatically add the corresponding values to the tables (people, col1 and col2), create a row with the desire values and foreign keys, and add that row to the primary table. However, there are a lot of columns and rows, and this might get very slow. Plus, I don't feel very confident that this is a "best practice" when dealing with databases (I'm fairly new to database development)

My question is: Is there a way to export a pandas DataFrame to multiple SQL Tables, setting the normalization rules, as in the above example? Is there any way to get the same result, with improved performance?

1 Answer 1

1

Could you first split your Pandas data frame into several sub data frames according to the database tables, then apply the to_sql() method on each sub data frames?

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, that would be one option. But how would I change the values and normalize it?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.