1

I have a pandas dataframe containing windows 10 logs. I want this pandas df to convert to JSON. What is an efficient way to do this?

I already made it to generate a default pandas df, however this is not nested. How I want it

{
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "1": {
        "ProcessName": "Excel",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "Word",
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0,
        "internal_time": 1.5533333333,
        "counter": 0
}

I want it to look like like this

{
    "0": {
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes" : {
                     "Firefox" : 0 # ("counter" value),
                     "Excel" : 0 
    },
    "1": ...
}
1
  • As a token of appreciation could you please mark one of the answers as the correct one (the tick mark on the left from an answer)? Commented Jul 18, 2019 at 11:41

2 Answers 2

2

It seems to me that you want to create JSON from an aggregated data based on ['time', 'timeFloat', 'internal_time'] which you can get doing:

pd.groupby(['time', 'timeFloat', 'internal_time'])

However, your example suggests that you want to maintain the index key ("0", "1", etc.) which is contrary to the previously stated intention.

The aggregated values from one time point:

"Firefox" : 0
"Excel" : 0 

seem like correspond to these index keys which will be lost when you do the aggregation.

However, if you decided using aggregation the code would look something like this:

# reading in data:

import pandas as pd
import json
json_data = {
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "1": {
        "ProcessName": "Excel",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "Word",
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0,
        "internal_time": 1.5533333333,
        "counter": 0
}}

df = pd.DataFrame.from_dict(json_data)
df = df.T
df.set_index(["ProcessName", 'time', 'timeFloat', 'internal_time', 'counter'])

# processing:
ddf = df.groupby(['time', 'timeFloat', 'internal_time'], as_index=False).agg(lambda x: list(x))
ddf['Processes'] = ddf.apply(lambda r: dict(zip(r['ProcessName'], r['counter'])), axis=1)
ddf = ddf.drop(['ProcessName', 'counter'], axis=1).

# printing the result:
json2 = json.loads(ddf.to_json(orient="records"))
print(json.dumps(json2, indent=4, sort_keys=True))

Result:

[
    {
        "Processes": {
            "Excel": 0,
            "Firefox": 0
        },
        "internal_time": 0.0,
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0
    },
    {
        "Processes": {
            "Word": 0
        },
        "internal_time": 1.5533333333,
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0
    }
]
Sign up to request clarification or add additional context in comments.

Comments

1

As I understand you need group objects by "time" and merge counters from different processes. If yes - here is an example of implementation:

input_data = {
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "ZXC",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "3": {
        "ProcessName": "QWE",
        "time": "else_time",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    }
}


def group_input_data_by_time(dict_data):
    time_data = {}
    for value_dict in dict_data.values():
        counter = value_dict["counter"]
        process_name = value_dict["ProcessName"]
        time_ = value_dict["time"]
        common_data = {
            "time": time_,
            "timeFloat": value_dict["timeFloat"],
            "internal_time": value_dict["internal_time"],
        }
        common_data = time_data.setdefault(time_, common_data)
        processes = common_data.setdefault("Processes", {})
        processes[process_name] = counter

    # if required to change keys from time to enumerated
    result_dict = {}
    for ind, value in enumerate(time_data.values()):
        result_dict[str(ind)] = value

    return result_dict


print(group_input_data_by_time(input_data))

Result is:

{
    "0": {
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes": {
            "Firefox": 0,
            "ZXC": 0
        }
    },
    "1": {
        "time": "else_time",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes": {
            "QWE": 0
        }
    }
}

1 Comment

This does not utilize the fact that the data is in pandas dataframe. It is going to scale slower with more data than a solution based on pandas.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.