Check for repeated values in JSON object array

Question

I have a large JSON file with this structure:

[
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c9a",
      "temp":36.33,
      "x":-0.484375,
      "y":-0.0078125,
      "z":-0.859375,
      "rssi":-70,
      "id":-26648,
      "date":"2021-06-02/09:24:06.238"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.5078125,
      "y":0.0234375,
      "z":-0.84375,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.028"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.4921875,
      "y":0.0078125,
      "z":-0.8671875,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.153"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.4765625,
      "y":0.0234375,
      "z":-0.8671875,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.278"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.265625,
      "y":-0.0390625,
      "z":-0.9921875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.058"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.21875,
      "y":0.015625,
      "z":-0.9296875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.183"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.203125,
      "y":0.046875,
      "z":-0.9609375,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.308"
   }
]

What I'm trying to do is sort this file first by serial then by date, and remove any objects with the same id (even if some values change like sniffer_serial).

This is what I got so far:

import json
from itertools import groupby

#json filepath
json_file_path = "./myfile.json"

#opening and loading the file content
with open(json_file_path, 'r') as j:
     contents = json.loads(j.read())

data = {} #dict that will contain my sorted data

#sorting data
for key, items in groupby(sorted(contents, key = lambda x: (x['serial'], x['date'])), key=lambda x: x['serial']):
     data[key] = list(items)

#saving it as new file
with open('datasorted.json', 'w') as f:
    f.write(str(data))

What i'm having trouble with is removing the duplicated objects that have the same id. Should I create another dict and iterate to see if already has an entry with the same id inside it ?

How I expect the final JSON file to look like:

[
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.5078125,
      "y":0.0234375,
      "z":-0.84375,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.028"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c9a",
      "temp":36.33,
      "x":-0.484375,
      "y":-0.0078125,
      "z":-0.859375,
      "rssi":-70,
      "id":-26648,
      "date":"2021-06-02/09:24:06.238"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.265625,
      "y":-0.0390625,
      "z":-0.9921875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.058"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.21875,
      "y":0.015625,
      "z":-0.9296875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.183"
   }
]

EDIT:

Creating a Pandas dataframe and trying to drop duplicates is raising the following error:

KeyError: Index(['id'], dtype='object')

Code:

dataPandas = pd.DataFrame.from_dict(data,orient='index')

dataPandas.drop_duplicates(subset="id",keep="first")

In your for key, items loop, items is an iterator that contains all the items in that group. If you only care about one of the items, just set that value: data[key] = list(items)[0]. Note though that your final data will be a dict. If you want it to be a list like it was before, do data = [] and data.append(list(items)[0]) — pho
– pho, Commented Jun 7, 2021 at 17:54

pho · Accepted Answer · 2021-06-07 22:07:23Z

I see a few issues:

You want a list but you add all the items you care about to a dict. Then you write this dict to your output json.
In your for key, items loop, items is an iterator that contains all the items in that group. If you only care about one of the items (e.g. the first), just set that value like so: data[key] = list(items)[0]

Incorporating these changes, you'd get:

data = [] #dict that will contain my sorted data

#sorting data
for key, items in groupby(sorted(contents, key = lambda x: (x['serial'], x['date'])), key=lambda x: x['id']):
     data.append(next(items))

next(items) gets only the next item of the iterator. On the other hand, list(items)[0] would convert the entire iterator to a list, and then take the first element.

This gives us the following data:

print(json.dumps(data, indent=4))

[
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c94",
        "temp": 35.08,
        "x": -0.5078125,
        "y": 0.0234375,
        "z": -0.84375,
        "rssi": -87,
        "id": -26633,
        "date": "2021-06-02/09:24:06.028"
    },
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c9a",
        "temp": 36.33,
        "x": -0.484375,
        "y": -0.0078125,
        "z": -0.859375,
        "rssi": -70,
        "id": -26648,
        "date": "2021-06-02/09:24:06.238"
    },
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39d3b",
        "temp": 37.19,
        "x": -0.265625,
        "y": -0.0390625,
        "z": -0.9921875,
        "rssi": -86,
        "id": -30714,
        "date": "2021-06-02/09:24:06.058"
    }
]

One potential problem with this: You sort first and then do groupby. I'm not sure if this breaks the sort order, but you could always do the groupby first and then sort on the serial.

unique_contents = [next(v) for k, v in groupby(contents, key=lambda x: x['id'])]
data = sorted(unique_contents, key=lambda x: (x['serial'], x['date']))

Or in one line, use the generator expression that drove the unique_contents list comprehension as the iterator you are sorting:

data = sorted(
    (next(v) for k, v in groupby(contents, key=lambda x: x['id'])), 
    key=lambda x: (x['serial'], x['date'])
)

Also note: you can read and write json directly from the file:

#opening and loading the file content
with open(input_file_path, 'r') as j:
     contents = json.load(j)

with open(output_file_path, 'w') as j:
    json.dump(data, indent=4) # indent=4 for pretty-print

Paul P · Accepted Answer · 2021-06-07 20:04:40Z

1

Your approach looks solid.

If you don't care about which of the duplicate elements you are using down the line, you can just take the first one:

...
for key, items in groupby(
    sorted(
        contents,
        key=lambda x: (x['serial'], x['date'])
    ),
    key=lambda x: x['serial']
):
    # items is an iterator and if you only care about the first element,
    # you can call next() once on it (instead of converting it to a list),
    # so that it doesn't iterate all entries.
    data[key] = next(items)

# Save as new file with indentation
with open('datasorted.json', 'w') as f:
    json.dump(data, f, indent=4)

The output will look like this:

{
    "086bd7c39c94": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c94",
        "temp": 35.08,
        "x": -0.5078125,
        "y": 0.0234375,
        "z": -0.84375,
        "rssi": -87,
        "id": -26633,
        "date": "2021-06-02/09:24:06.028"
    },
    "086bd7c39c9a": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c9a",
        "temp": 36.33,
        "x": -0.484375,
        "y": -0.0078125,
        "z": -0.859375,
        "rssi": -70,
        "id": -26648,
        "date": "2021-06-02/09:24:06.238"
    },
    "086bd7c39d3b": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39d3b",
        "temp": 37.19,
        "x": -0.265625,
        "y": -0.0390625,
        "z": -0.9921875,
        "rssi": -86,
        "id": -30714,
        "date": "2021-06-02/09:24:06.058"
    }
}

edited Jun 7, 2021 at 20:04

answered Jun 7, 2021 at 19:53

Paul P

4,0372 gold badges15 silver badges28 bronze badges

5 Comments

Leonardo Felix da Silva Over a year ago

I think it is my fault for not making my question clear. This is a great solution but it's not what I'm looking for, because this only takes the first serial it finds to add to the dict. What I want is to have no duplicate id's, preserving the other objects (even with the same serial) so I can later plot this data trought time. If change key=lambda x: x['serial'] to key=lambda x: x['id'] it messes with the ordering.

Leonardo Felix da Silva Over a year ago

I edited my question with how the JSON should look like to further clarify thinngs. Thanks for your help

pho Over a year ago

@LeonardoFelixdaSilva why do you expect "serial": "086bd7c39c94" to be sorted after "serial": "086bd7c39c9a"? That is the wrong sort order "086bd7c39c9a" > "086bd7c39c94" (both lexicographically and if you interpreted it as a hex number)

Leonardo Felix da Silva Over a year ago

@PranavHosangadi that happened because I was trying to do 2 things at the same time with not nearly as much attention as I should have given them. I'm sorry for that, I fixed that typo

Paul P Over a year ago

I see, in that case, the accepted answer more accurate.

zglin · Accepted Answer · 2021-06-07 17:55:30Z

0

Consider using pandas to create a DataRame with your dictionary using pd.DataFrame.from_dict and then running the de-dupe (pandas.DataFrame.drop_duplicates) functions.

answered Jun 7, 2021 at 17:55

zglin

2,9192 gold badges18 silver badges26 bronze badges

1 Comment

Leonardo Felix da Silva Over a year ago

Can you please elaborate further ? I never used pandas before and I got an KeyError when trying to drop duplicates. I edited my question with the code I just tried

Collectives™ on Stack Overflow

Check for repeated values in JSON object array

3 Answers 3

1 Comment

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related