2

i'm building a process to "outer join" two csv files and export the result as a json object.

# read the source csv files
firstcsv = pandas.read_csv('file1.csv',  names = ['main_index','attr_one','attr_two'])
secondcsv = pandas.read_csv('file2.csv',  names = ['main_index','attr_three','attr_four'])

# merge them
output = firstcsv.merge(secondcsv, on='main_index', how='outer')

jsonresult = output.to_json(orient='records')
print(jsonresult)

Now, the two csv files are like this:

file1.csv:
1, aurelion, sol
2, lee, sin
3, cute, teemo

file2.csv:
1, midlane, mage
2, jungler, melee

And I would like the resulting json to be outputted like:

[{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}]

instead i'm getting on the line with main_index = 3

{"main_index":3,"attr_one":"cute","attr_two":"teemo","attr_three":null,"attr_four":null}]

so nulls are added automatically in the output. I would like to remove them - i looked around but i couldn't find a proper way to do it.

Hope someone can help me around!

2 Answers 2

2

Since we're using a DataFrame, pandas will 'fill in' values with NaN, i.e.

>>> print(output)
      main_index   attr_one attr_two attr_three attr_four
0           1   aurelion      sol    midlane      mage
1           2        lee      sin    jungler     melee
2           3       cute    teemo        NaN       NaN

I can't see any options in the pandas.to_json documentation to skip null values: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

So the way I came up with involves re-building the JSON string. This probably isn't very performant for large datasets of millions of rows (but there's less than 200 champs in league so shouldn't be a huge issue!)

from collections import OrderedDict
import json

jsonresult = output.to_json(orient='records')
# read the json string to get a list of dictionaries
rows = json.loads(jsonresult)

# new_rows = [
#     # rebuild the dictionary for each row, only including non-null values
#     {key: val for key, val in row.items() if pandas.notnull(val)}
#     for row in rows
# ]

# to maintain order use Ordered Dict
new_rows = [
    OrderedDict([
        (key, row[key]) for key in output.columns
        if (key in row) and pandas.notnull(row[key])
    ])
   for row in rows
]

new_json_output = json.dumps(new_rows)

And you will find that new_json_output has dropped all keys that have NaN values, and kept the order:

>>> print(new_json_output)
[{"main_index": 1, "attr_one": " aurelion", "attr_two": " sol", "attr_three": " midlane", "attr_four": " mage"},
 {"main_index": 2, "attr_one": " lee", "attr_two": " sin", "attr_three": " jungler", "attr_four": " melee"},
 {"main_index": 3, "attr_one": " cute", "attr_two": " teemo"}]
Sign up to request clarification or add additional context in comments.

3 Comments

this works, but i lose the order of the elements (let's say I specified a custom order with the reindex_axis method) I guess i need to use some OrderedDict of sort to keep the sorting...
i did just find it yesterday night... but thanks so much for the help anyway!
if (key in row) and pandas.notnull(row[key]) won't work if a field is not scalar. Instead use: if (key in row) and (True if isinstance(row[key], collections.abc.Sequence) else pd.notnull(row[key]))
2

I was trying to achieve the same thing and found the following solution, that I think should be pretty fast (although I haven't tested that). A bit too late to answer the original question, but maybe useful to some.

# Data
df = pd.DataFrame([
    {"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
    {"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
    {"main_index":3,"attr_one":"cute","attr_two":"teemo"}
])

gives a DataFrame with missing values.

>>> print(df)
  attr_four  attr_one attr_three attr_two  main_index
0      mage  aurelion    midlane      sol           1
1     melee       lee    jungler      sin           2
2       NaN      cute        NaN    teemo           3

To convert it to a json, you can apply to_json() to each row of the transposed DataFrame, after filtering out empty values. Then join the jsons, separated by commas, and wrap in brackets.

# To json    
json_df = df.T.apply(lambda row: row[~row.isnull()].to_json())
json_wrapped = "[%s]" % ",".join(json_df)

Then

>>> print(json_wrapped)
[{"attr_four":"mage","attr_one":"aurelion","attr_three":"midlane","attr_two":"sol","main_index":1},{"attr_four":"melee","attr_one":"lee","attr_three":"jungler","attr_two":"sin","main_index":2},{"attr_one":"cute","attr_two":"teemo","main_index":3}]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.