0

I have a JSON data type raw.json

{"time": 12.640, "name": "machine1", "value": 24.0}
{"time": 12.645, "name": "machine2", "value": 0.0}
{"time": 12.65002, "name": "machine3", "value": true}
{"time": 12.66505, "name": "machine4", "value": 1.345}
{"time": 12.67007, "name": "machine5", "value": 5.068}
{"time": 12.67508, "name": "machine4", "value": 1.075}
{"time": 12.6801, "name": "machine5", "value": 2.0868}
{"time": 12.6851, "name": "machine4", "value": 0.0}
{"time": 12.6901, "name": "machine5", "value": 12.633}
{"time": 12.69512, "name": "machine5", "value": 13.13}
{"time": 12.70013, "name": "machine3", "value": false}
{"time": 12.70515, "name": "machine3", "value": false}
{"time": 12.71016, "name": "machine3", "value": false}
{"time": 12.71517, "name": "machine5", "value": 131.633}

So in my python script i am able to generate a line by line read and generate a list

import json

data = [];
timestamp =[];
with open('raw.json') as f:
    for line in f:
       data.append(json.loads(line))
    f.close()

for idx, val in enumerate(data):
   time = data[idx]['time']
   name = data[idx]['name']
   value = data[idx]['value']
   data_list = idx+1, time, name, value
   print data_list

output:

(1, 12.64, u'machine1', 24.0)
(2, 12.645, u'machine2', 0.0)
(3, 12.65002, u'machine3', True)
(4, 12.66505, u'machine4', 1.345)
(5, 12.67007, u'machine5', 5.068)
(6, 12.67508, u'machine4', 1.075)
(7, 12.6801, u'machine5', 2.0868)
(8, 12.6851, u'machine4', 0.0)
(9, 12.6901, u'machine5', 12.633)
(10, 12.69512, u'machine5', 13.13)
(11, 12.70013, u'machine3', False)
(12, 12.70515, u'machine3', False)
(13, 12.71016, u'machine3', False)
(14, 12.71517, u'machine5', 131.633)

I want to sort this data such that i can have individual lists (arrays) that i can use. e.g.

machine1 = [12.640, 24.0];
machine2 = [12.645, 0.0];
machine3 = [
12.65002,true
12.70013,false
12.70515,false
12.71016,false
]; 
machine4 = [
12.66505 1.345
12.67508 1.075
12.6851 0.0
];

and so on also in addition how can i search this tuple or the list directly to generate meta data like sum/average for machine1, machine 2 etc.

Sum_Machine1 = 24;
Sum_Machine2 = 0;....
3
  • i tried to search [x[2] for x in data_list].index('machine1') also [item for item in data_list if 0 in item] //to search for the location where the values are zero, did not even get to try searching for the string Commented Feb 21, 2014 at 2:16
  • also tried [i for i, v in enumerate(data_list) if v[2] == 'machine1'] Commented Feb 21, 2014 at 2:22
  • Check out pandas if you mean to do a lot of this stuff Commented Feb 21, 2014 at 3:06

2 Answers 2

2

First Solution

Here is how I approach the problem:

import json
import collections

if __name__ == '__main__':    
    # Load file into data
    with open('raw.json') as f:
        data = [json.loads(line) for line in f]

    # Calculate count and total
    time_total = collections.defaultdict(float)
    time_count = collections.defaultdict(int)
    for row in data:
        time_count[row['name']] += 1
        time_total[row['name']] += row['time']

    # Calculate average
    time_average = {}
    for name in time_count:
        time_average[name] = time_total[name] / time_count[name]

    # Report
    for name in sorted(time_count):
        print '{:<10} {:2} {:8.2f} {:8.2f}'.format(
            name,
            time_count[name],
            time_total[name],
            time_average[name])

Discussion

  • data is a list of dict with keys such as name, time, ...
  • I used three additional dictionaries to keep track of the count, total, and average per machine.
  • I assume you want your calculation based on the time value. If not, it is an easy fix.
  • The defaultdict is a nice way to tally numbers. If an int value is not already created, it will be created and assign value of 0, very convenient. You should look it up.

Second Solution

Here is a different approach: since your data looks like a table, why not use a database to handle your data. The advantage of this approach is you don't have to do calculations yourself.

import json
import sqlite3

if __name__ == '__main__':
    # Create an in-memory database for calculation
    connection = sqlite3.connect(':memory:')
    cursor = connection.cursor()
    cursor.execute('DROP TABLE IF EXISTS time_table')
    cursor.execute('CREATE TABLE time_table (name text, time real)')
    connection.commit()

    # Load file into database
    with open('raw.json') as f:
        for line in f:
            row = json.loads(line)
            cursor.execute('INSERT INTO time_table VALUES (?,?)', (row['name'], row['time']))
            connection.commit()

    # Report: print the name, count, sum, and average
    cursor.execute('SELECT name, COUNT(time), SUM(time), AVG(time) FROM time_table GROUP BY name')
    print '%-10s %8s %8s %8s' % ('NAME', 'COUNT', 'SUM', 'AVERAGE')
    for row in cursor.fetchall():
        print '%-10s %8d %8.2f %8.2f' % row

    connection.close()

Output

NAME          COUNT      SUM  AVERAGE
machine1          1    12.64    12.64
machine2          1    12.64    12.64
machine3          4    50.77    12.69
machine4          3    38.03    12.68
machine5          5    63.45    12.69

Discussion

  • In this solution, I created an in-memory SQLite3 database
  • Since we are only interested in the name and time columns, the table only contains those two.
  • We got all the statistical functions such as SUM, COUNT, and AVG for free, just by using the database.

Addition to First Solution

To answer the question: Given machine5, how can I get the last value? By that, I assume you want to filter your data down to those containing machine5, then sort them by time and select the last row. For the first solution, append the following block of code and run it:

# Filter data: prints all rows with 'machine5'
print '\nFilter by machine5'
machine5 = [row for row in data if row['name'] == 'machine5']
machine5 = sorted(machine5, key=lambda row: int(row['time']))
pprint(machine5)

# Get the last instance
print '\nLast instance of machine5:'
latest_row = machine5[-1]
pprint(latest_row)

Don't forget to add the following at the beginning of the script:

from pprint import pprint

Output

Filter by machine5
[{u'name': u'machine5', u'time': 12.67007, u'value': 5.068},
 {u'name': u'machine5', u'time': 12.6801, u'value': 2.0868},
 {u'name': u'machine5', u'time': 12.6901, u'value': 12.633},
 {u'name': u'machine5', u'time': 12.69512, u'value': 13.13},
 {u'name': u'machine5', u'time': 12.71517, u'value': 131.633}]

Last instance of machine5:
{u'name': u'machine5', u'time': 12.71517, u'value': 131.633}

Discussion

If you do not want to sort the rows by time, then remove the sorted() line and that will give you the unsorted output.

Sign up to request clarification or add additional context in comments.

5 Comments

if i still wanted to get individual arrays or tables for each machines i.e. machine1 = [12.640, 24.0]; machine2 = [12.645, 0.0]; machine3 = [ 12.65002,true 12.70013,false 12.70515,false 12.71016,false ]; machine4 = [ 12.66505, 1.345 12.67508, 1.075 12.6851, 0.0 ]; How would you go about it, is there a benefit to create an SQL DB vs using a collection in each case
I highly recommend against having separate variables to store data like that. It makes computation much harder than it should.
What would be your suggestion to search the dictionary for lets say the last known value and timestamp of machine-5 i.e. timestamp = 12.7517, and value = 131.633. Given the fact that dictionaries are not ordered based on the value entry, but my goal is to retrieve the last "value" of a particular key ("machine1")
Please see the Addition to First Question section I just added.
once i was able to construct a dictionary machine5 = [row for row in data if row['name'] == 'machine5'] i get all the machine name values, but if i want to total the value names, i.e. 5.068+2.0868+12.633+13.13+131.633.
1

Make each row a class (not strictly necessary but nice), overload cmp and use sort

class MachineInfo:

    def __init__(self, info_time, name, value):
        self.info_time = info_time
        self.name = name
        self.value = value

def cmp_machines(a, b):
    return cmp(a.name, b.name)

Also sort takes an optional comparison function..

info = [... fill this with MachineInfo instances here ...]

# then call 
info = sorted(info, cmp_machines)

# or to sort in place
info.sort(cmp_machines)

# alternatively add a  __cmp__ method to MachineInfo and that will get used by default

There's fancier ways of doing it.. https://wiki.python.org/moin/HowTo/Sorting But it's nice to keep things simple and obvious.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.