Python - Pandas - How to drop null values from to_json after dataframe merge

Question

i'm building a process to "outer join" two csv files and export the result as a json object.

# read the source csv files
firstcsv = pandas.read_csv('file1.csv',  names = ['main_index','attr_one','attr_two'])
secondcsv = pandas.read_csv('file2.csv',  names = ['main_index','attr_three','attr_four'])

# merge them
output = firstcsv.merge(secondcsv, on='main_index', how='outer')

jsonresult = output.to_json(orient='records')
print(jsonresult)

Now, the two csv files are like this:

file1.csv:
1, aurelion, sol
2, lee, sin
3, cute, teemo

file2.csv:
1, midlane, mage
2, jungler, melee

And I would like the resulting json to be outputted like:

[{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}]

instead i'm getting on the line with main_index = 3

{"main_index":3,"attr_one":"cute","attr_two":"teemo","attr_three":null,"attr_four":null}]

so nulls are added automatically in the output. I would like to remove them - i looked around but i couldn't find a proper way to do it.

Hope someone can help me around!

Hazzles · Accepted Answer · 2017-09-14 00:24:29Z

2

Since we're using a DataFrame, pandas will 'fill in' values with NaN, i.e.

>>> print(output)
      main_index   attr_one attr_two attr_three attr_four
0           1   aurelion      sol    midlane      mage
1           2        lee      sin    jungler     melee
2           3       cute    teemo        NaN       NaN

I can't see any options in the pandas.to_json documentation to skip null values: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

So the way I came up with involves re-building the JSON string. This probably isn't very performant for large datasets of millions of rows (but there's less than 200 champs in league so shouldn't be a huge issue!)

from collections import OrderedDict
import json

jsonresult = output.to_json(orient='records')
# read the json string to get a list of dictionaries
rows = json.loads(jsonresult)

# new_rows = [
#     # rebuild the dictionary for each row, only including non-null values
#     {key: val for key, val in row.items() if pandas.notnull(val)}
#     for row in rows
# ]

# to maintain order use Ordered Dict
new_rows = [
    OrderedDict([
        (key, row[key]) for key in output.columns
        if (key in row) and pandas.notnull(row[key])
    ])
   for row in rows
]

new_json_output = json.dumps(new_rows)

And you will find that new_json_output has dropped all keys that have NaN values, and kept the order:

>>> print(new_json_output)
[{"main_index": 1, "attr_one": " aurelion", "attr_two": " sol", "attr_three": " midlane", "attr_four": " mage"},
 {"main_index": 2, "attr_one": " lee", "attr_two": " sin", "attr_three": " jungler", "attr_four": " melee"},
 {"main_index": 3, "attr_one": " cute", "attr_two": " teemo"}]

edited Sep 14, 2017 at 0:24

answered Sep 13, 2017 at 2:59

Hazzles

4762 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mik1893 Over a year ago

this works, but i lose the order of the elements (let's say I specified a custom order with the reindex_axis method) I guess i need to use some OrderedDict of sort to keep the sorting...

Mik1893 Over a year ago

i did just find it yesterday night... but thanks so much for the help anyway!

Hossein Dehno Over a year ago

if (key in row) and pandas.notnull(row[key]) won't work if a field is not scalar. Instead use: if (key in row) and (True if isinstance(row[key], collections.abc.Sequence) else pd.notnull(row[key]))

knirb · Accepted Answer · 2018-03-20 16:58:57Z

I was trying to achieve the same thing and found the following solution, that I think should be pretty fast (although I haven't tested that). A bit too late to answer the original question, but maybe useful to some.

# Data
df = pd.DataFrame([
    {"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
    {"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
    {"main_index":3,"attr_one":"cute","attr_two":"teemo"}
])

gives a DataFrame with missing values.

>>> print(df)
  attr_four  attr_one attr_three attr_two  main_index
0      mage  aurelion    midlane      sol           1
1     melee       lee    jungler      sin           2
2       NaN      cute        NaN    teemo           3

To convert it to a json, you can apply to_json() to each row of the transposed DataFrame, after filtering out empty values. Then join the jsons, separated by commas, and wrap in brackets.

# To json    
json_df = df.T.apply(lambda row: row[~row.isnull()].to_json())
json_wrapped = "[%s]" % ",".join(json_df)

Then

>>> print(json_wrapped)
[{"attr_four":"mage","attr_one":"aurelion","attr_three":"midlane","attr_two":"sol","main_index":1},{"attr_four":"melee","attr_one":"lee","attr_three":"jungler","attr_two":"sin","main_index":2},{"attr_one":"cute","attr_two":"teemo","main_index":3}]

Collectives™ on Stack Overflow

Python - Pandas - How to drop null values from to_json after dataframe merge

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related