I have the following JSON file:
sensorlogs.json
{"arr":[{"UTCTime":10000001,"s1":22,"s2":32,"s3":42,"s4":12},
{"UTCTime":10000002,"s1":23,"s2":33,"s4":13},
{"UTCTime":10000003,"s1":24,"s2":34,"s3":43,"s4":14},
{"UTCTime":10000005,"s1":26,"s2":36,"s3":44,"s4":16},
{"UTCTime":10000006,"s1":27,"s2":37,"s4":17},
{"UTCTime":10000004,"s1":25,"s2":35,"s4":15},
...
{"UTCTime":12345678,"s1":57,"s2":35,"s3":77,"s4":99}
]}
Sensors s1, s2, s3, etc all are transmitting at different frequencies (Note that s3 is transmitting every 2 seconds, and timestanps can be out of order).
How can I achieve something like -
Analyzing s1:
s = [[10000001, 22], [10000002, 23],.. [12345678,57]]
s1 had 2 missing entries
Analyzing s2:
s = [[10000001, 32], [10000002, 33],.. [12345678,35]]
s2 had 0 missing entries
Analyzing s3:
s = [[10000001, 42], [10000003, 43],.. [12345678,77]]
s3 had 0 missing entries
Analyzing s4:
s = [[10000001, 12], [10000003, 13],.. [12345678,99]]
s4 had 1 missing entries
sensorlogs.json is 16 GB.
Missing entries can be found based on the difference in the consecutive UTC timestamps. Each sensor is transmitted at a known frequency.
I cannot use multiple large arrays for my analysis due to memory constraints, so I will have to make multiple passes over the same JSON log file and use only single large array for analysis.
What I have till now is following -
var result = [];
//1. Extract all the keys from the log file
console.log("Extracting keys... \n");
var stream = fs.createReadStream(filePath);
var lineReader = lr.createInterface(
{
input: stream
});
lineReader.on('line', function (line)
{
getKeys(line);//extract all the keys from the JSON
});
stream.on('end', function()
{
//obj -> arr
for(var key in tmpObj)
arrStrm.push(key);
//2. Validate individual sensors
console.log("Validating the sensor data ...\n");
//Synchronous execution of the sensors in the array
async.each(arrStrm, function(key)
{
{
currSensor = key;
console.log("validating " + currSensor + "...\n");
stream = fs.createReadStream(filePath);
lineReader = lr.createInterface(
{
input: stream
});
lineReader.on('line', function (line)
{
processLine(line);//Create the arrays for the sensors
});
stream.on('end', function()
{
processSensor(currSensor);//Process the data for the current sensor
});
}
});
});
function getKeys(line)
{
if(((pos = line.indexOf('[')) >= 0)||((pos = line.indexOf(']')) >= 0))
return;
if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D)
if (line[line.length-1] == ',') line=line.substr(0,line.length-1); // discard ,
// console.log(line);
if (line.length > 1)
{ // ignore empty lines
var obj = JSON.parse(line); // parse the JSON
for(var key in obj)
{
if(key != "debug")
{
if(tmpObj[key] == undefined)
tmpObj[key]=[];
}
};
}
}
Of course this doesn't work, and I am not able to find anything on the net which explains how this can be implemented.
Note: I can choose any language of my choice to develop this tool (C/C++,C#/Java/Python), but I am going with JavaScript because of it's capability of parsing JSON arrays easily (and my interest in getting better in JS as well). Does someone like to suggest an alternate language to do this if JavaScript isn't the best language make such a tool?
Edit: Some important info which either is not very clear or I did not include earlier, but looks like it is important to include in the question -
- The data in the JSON logs is not streaming live, its a stored JSON file in a hard disk
- Data stored is not in chronological order, which means that the timestamps might not be in the correct order. So each sensor data needs to be sorted based on the timestamps after it has been stored in an array
- I can not use separate arrays for each sensor (that will be same as storing entire 16 GB JSON in RAM), and to save memory, only one array should be used at a time. And yes, there are more than 4 sensors in my log, this is just a sample (roughly 20 to give an idea)
I have modified my JSON and expected output
One solution might be to make multiple passes over the JSON file, storing one sensor data with timestamps in an array at a time, then sorting the array and then finally analyzing the data for corruption and gaps. And thats what I'm trying to do in my code above
awk,sedandsortto the rescue!