How to create a nested JSON from a pandas dataframe in Python

Question

I have a pandas dataframe containing windows 10 logs. I want this pandas df to convert to JSON. What is an efficient way to do this?

I already made it to generate a default pandas df, however this is not nested. How I want it

{
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "1": {
        "ProcessName": "Excel",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "Word",
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0,
        "internal_time": 1.5533333333,
        "counter": 0
}

I want it to look like like this

{
    "0": {
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes" : {
                     "Firefox" : 0 # ("counter" value),
                     "Excel" : 0 
    },
    "1": ...
}

As a token of appreciation could you please mark one of the answers as the correct one (the tick mark on the left from an answer)? — sophros
– sophros, Commented Jul 18, 2019 at 11:41

sophros · Accepted Answer · 2019-07-15 15:41:28Z

It seems to me that you want to create JSON from an aggregated data based on ['time', 'timeFloat', 'internal_time'] which you can get doing:

pd.groupby(['time', 'timeFloat', 'internal_time'])

However, your example suggests that you want to maintain the index key ("0", "1", etc.) which is contrary to the previously stated intention.

The aggregated values from one time point:

"Firefox" : 0
"Excel" : 0

seem like correspond to these index keys which will be lost when you do the aggregation.

However, if you decided using aggregation the code would look something like this:

# reading in data:

import pandas as pd
import json
json_data = {
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "1": {
        "ProcessName": "Excel",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "Word",
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0,
        "internal_time": 1.5533333333,
        "counter": 0
}}

df = pd.DataFrame.from_dict(json_data)
df = df.T
df.set_index(["ProcessName", 'time', 'timeFloat', 'internal_time', 'counter'])

# processing:
ddf = df.groupby(['time', 'timeFloat', 'internal_time'], as_index=False).agg(lambda x: list(x))
ddf['Processes'] = ddf.apply(lambda r: dict(zip(r['ProcessName'], r['counter'])), axis=1)
ddf = ddf.drop(['ProcessName', 'counter'], axis=1).

# printing the result:
json2 = json.loads(ddf.to_json(orient="records"))
print(json.dumps(json2, indent=4, sort_keys=True))

Result:

[
    {
        "Processes": {
            "Excel": 0,
            "Firefox": 0
        },
        "internal_time": 0.0,
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0
    },
    {
        "Processes": {
            "Word": 0
        },
        "internal_time": 1.5533333333,
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0
    }
]

Yuriy Leonov · Accepted Answer · 2019-07-15 14:06:54Z

As I understand you need group objects by "time" and merge counters from different processes. If yes - here is an example of implementation:

input_data = {
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "ZXC",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "3": {
        "ProcessName": "QWE",
        "time": "else_time",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    }
}


def group_input_data_by_time(dict_data):
    time_data = {}
    for value_dict in dict_data.values():
        counter = value_dict["counter"]
        process_name = value_dict["ProcessName"]
        time_ = value_dict["time"]
        common_data = {
            "time": time_,
            "timeFloat": value_dict["timeFloat"],
            "internal_time": value_dict["internal_time"],
        }
        common_data = time_data.setdefault(time_, common_data)
        processes = common_data.setdefault("Processes", {})
        processes[process_name] = counter

    # if required to change keys from time to enumerated
    result_dict = {}
    for ind, value in enumerate(time_data.values()):
        result_dict[str(ind)] = value

    return result_dict


print(group_input_data_by_time(input_data))

Result is:

{
    "0": {
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes": {
            "Firefox": 0,
            "ZXC": 0
        }
    },
    "1": {
        "time": "else_time",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes": {
            "QWE": 0
        }
    }
}

This does not utilize the fact that the data is in pandas dataframe. It is going to scale slower with more data than a solution based on pandas.

Collectives™ on Stack Overflow

How to create a nested JSON from a pandas dataframe in Python

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related