3

I am indexing data from a pandas dataframe in elasticsearch. I have null_value set for some es fields, but not others. How do I drop the columns without null_value but leave those with (setting value to None)?

es mapping:

    "properties": {
        "sa_start_date": {"type": "date", "null_value": "1970-01-01T00:00:00+00:00"},
        "location_name": {"type": "text"},

code:

cols_with_null_value = ['sa_start_date']
orig = [{
    'meter_id': 'M1',
    'sa_start_date': '',
    'location_name': ''
},{
    'meter_id': 'M1',
    'sa_start_date': '',
    'location_name': 'a'
}]
df = pd.DataFrame.from_dict(orig)

df['sa_start_date'] = df['sa_start_date'].apply(pd.to_datetime, utc=True, errors='coerce')
df.replace({'': np.nan}, inplace=True)
df:
   meter_id sa_start_date location_name
0       M1           NaT           NaN
1       M1           NaT             a

dicts needed for elasticsearch index:

{"meter_id": M1, "sa_start_date": None}
{"meter_id": M1, "sa_start_date": None, "location_name": "a"}

Note location_name cells with NaN are not indexed, but sa_start_date cells with NaT are. I've tried many things, each more ridiculous than the last; have nothing worth showing. Any ideas appreciated!

Tried this but the Nones are dropped along with the NaNs..

df[null_value_cols] = df[null_value_cols].replace({np.nan: None})
df:
   meter_id sa_start_date location_name
0       M1          None           NaN
1       M1          None             a
for row in df.iterrows():
    ser = row[1]
    ser.dropna(inplace=True)

    lc = {k: v for k, v in dict(row[1]).items()}

lc: {'meter_id': 'M1'}
lc: {'meter_id': 'M1', 'location_name': 'a'}
0

1 Answer 1

6

Don't use .dropna() here. It will either drop entire rows, or entire columns; and you want to keep everything with the exception of empty location names.

You can do this in the following way:

df.replace({'': None}, inplace=True) # replace with None instead of np.nan

for idx,row in df.iterrows(): 
    lc = {k:v for k,v in row.items() if not (k == 'location_name' and v is None)} 
    print(lc) 

Result:

{'meter_id': 'M1', 'sa_start_date': None}
{'meter_id': 'M1', 'sa_start_date': None, 'location_name': 'a'}
Sign up to request clarification or add additional context in comments.

3 Comments

The problem with replacing Nan with None leads to the problem that all data types of the data frame will become 'object'. Any solutions for this?
@Chiel: what downstream issue(s) does that cause? i'm asking because any possible solution might depend on what else you want to do with the DataFrame.
I have a dataframe with various types ranging from int64, float64, and datetime64s. Whenever I use df = df.replace({np.nan: None}), it works in the sense that it properly replaces NaN, NaT, and NA with None. However, all the different datatypes in the df change to object. Also, it works sending all these objects to Elastic using eland.pandas_to_eland(), but since all the df columns are of type object everything in Elastic becomes keyword. As a solution now I'm using the es_type_overrides parameter in pandas_to_eland() to override the types. I'm wondering if there is cleaner solution though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.