How to get pandas dataframe.to_sql to use default values defined by database DDL for NaN?

Question

In my database ddl (SQL Server) I have fields which have default values:

my_database_field varchar(255)  NOT NULL 
     DEFAULT  'NVT'

However, when I try to insert data into this table using DataFrame.to_sql() I get the following error:

Cannot insert the value NULL into column 'my_database_field', table 'MYDB.dbo.my_table'; column does not allow nulls. INSERT fails.

Here is the code:

with engine.begin() as conn:
    dataframe.to_sql(table_name, conn, index=False, if_exists='append')

The column 'my_database_field' has NaN values:

data['my_database_field']
0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
4154   NaN
4155   NaN
4156   NaN
4157   NaN
4158   NaN

This seems to be because the to_sql method is replacing NaN values with NULLs (probably how it should be)

I want the option to insert NaN values and have the database use the DEFAULT value defined in the schema instead of inserting NULL directly when using the to_sql method.

I've tried replacing all NaNs in the dataframe with empty strings however this suffers from the same problem. It propagates those empty strings to the data base and that's what appears, not the default value specified in the DDL.

The only way I can get it to work is to not use pandas dataframe.to_sql method at all but instead load the data from a csv file and write line by line into the data base using basic sqlalchemy SQL Expressions.

I've opened an issue in the pandas github: https://github.com/pandas-dev/pandas/issues/42408

Edit: As per @Larnu's comment, the key bit of information I was missing was that:

A DEFAULT value is only used when you don't specify the column in the INSERT. If you (or the application) explicitly supplies a NULL value, then a NULL value will be inserted. DEFAULT isn't a "use this value when NULL is supplied" property it's a "use this value when the column is omitted"

The problem now becomes how to elegantly handle inserting a large dataframe where some columns contain some NaNs. I'd like to avoid having to go row by row to insert the data with a check to see if any values in that row are NaN.

Edit 2: I solved this problem by using the ORM as follows:

Base = declarative_base()


class MyModel(Base):
    __tablename__ = 'my_table'

my_id = Column(String, primary_key=True)
my_integer_field = Column(Integer)
my_database_field = Column(String, server_default='NVT')

@staticmethod
def from_dict(input_dict: dict):
    for k, v in input_dict.items():
        logging.debug(f"Key: {k} - Value: {v}")
        if type(v) is str:
            continue
        try:
            if np.isnan(v):
                input_dict[k] = None
        except TypeError as e:
            logging.debug(e)
            logging.debug(f"Could not test if: {type(v)}:{v} was NaN")

    return __class__(**input_dict)

As you can see I have added a custom static method called 'from_dict' which builds my model from a dictionary. For each value I check if it is a np.nan and if it is I set it to None instead. This seems to make the ORM not include that field in the insert method under the hood and therefore let the database use the default value. The np.isnan() method only works for numbers not strings so for those I just catch the exception, log out some info and move on.

For string values it seems I must set the 'server_default' in the model itself. If anyone knows more about what's going on here please feel free to share.

A DEFAULT value is only used when you don't specify the column in the INSERT. If you (or the application) explicitly supplies a NULL value, then a NULL value will be inserted. DEFAULT isn't a "use this value when NULL is supplied" property it's a "use this value when the column is omitted" property. — Thom A
– Thom A ♦, Commented Jul 8, 2021 at 11:44
@Larnu That's a good point. Something I didn't know. Any suggestions on how to handle this then? I need to batch insert this data, some of which have values, the rest are NaN. I'd like to avoid going line by line and checking if a given row has NaN values and removing the column header if it does. — Robin Staunton-Collins
– Robin Staunton-Collins, Commented Jul 8, 2021 at 12:48
Sorry if I missed this, but is "my_database_field" the only column you are concerned with using some default value for? — Jason Cook
– Jason Cook, Commented Jul 8, 2021 at 15:56
Any issue letting pandas put in the 'default' value? .fillna('NVT') — FLSte
– FLSte, Commented Jul 9, 2021 at 2:37
Seems like you need to process the data frame on the Python side first, i.e.: to replace the NaN values with the column default specified in the database. — AlwaysLearning
– AlwaysLearning, Commented Jul 9, 2021 at 2:37

FLSte · Accepted Answer · 2021-07-12 02:48:09Z

As @Larnu mentions in his comment, the default value will only be used when that column is not stated in the insert statement. Pandas is seeing that column name in the dataframe and inserting. So you need to split up your data and remove columns where you want defaults.

With small data, you can iterate through the rows, use dropna on the columns and then to_sql.

For larger datasets you could split up your data so that you upload only rows with no NaN's first. Then create sets for every combination of columns, on rows where all those column values are null, drop the columns and only upload the columns that do have data. For example a dataframe with columns A and B, upload all rows where both values are not null, upload a dataframe of column A only where the value in B is null and vice versa.

import numpy as np
import pandas as pd
import itertools

#create random dataframe of integers
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), 
    columns=['A','B','C','D'])
#turn ~20% of them into NaN's
df = df.mask(np.random.random(df.shape) < .2)

#upload the no nulls data first
df.dropna().to_sql(table_name, conn, index=False, if_exists='append')

#get column names into list
cols = list(df.columns)
#create powerset of column names
p_set = itertools.chain.from_iterable(
    itertools.combinations(cols,r) for r in range(1,len(cols)))

#iterate through all combinations of column names
for null_cols in p_set:
    #determine what the columns are where we DO want values
    not_null_cols = list(set(cols) - set(null_cols))

    #filter original dataframe to rows where ALL the null_cols are 
    # null AND ALL the other cols have a non-null value
    sub_df = df[df[list(null_cols)].isnull().all(1) &
                df[list(not_null_cols)].notnull().all(1)]

    #if this dataframe has values, send it to database
    if len(sub_df) > 0:
        #sub_df still has all columns at this point, only keep the
        #  ones with values
        sub_df = sub_df[not_null_cols]
        sub_df.to_sql(table_name, conn, index=False, if_exists='append')

I considered this idea but only got as far as removing all the rows with NaNs. I found that out of a dataset of 4000 rows only about 150 rows had no NaNs in any column. I couldn't actually work out how to break up the data beyond that. So hat's off to you for figuring that out. In the end I went with using the ORM, see edited post. Thanks for your answer!

Collectives™ on Stack Overflow

How to get pandas dataframe.to_sql to use default values defined by database DDL for NaN?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related