1

I have a large JSON file with this structure:

[
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c9a",
      "temp":36.33,
      "x":-0.484375,
      "y":-0.0078125,
      "z":-0.859375,
      "rssi":-70,
      "id":-26648,
      "date":"2021-06-02/09:24:06.238"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.5078125,
      "y":0.0234375,
      "z":-0.84375,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.028"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.4921875,
      "y":0.0078125,
      "z":-0.8671875,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.153"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.4765625,
      "y":0.0234375,
      "z":-0.8671875,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.278"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.265625,
      "y":-0.0390625,
      "z":-0.9921875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.058"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.21875,
      "y":0.015625,
      "z":-0.9296875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.183"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.203125,
      "y":0.046875,
      "z":-0.9609375,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.308"
   }
]

What I'm trying to do is sort this file first by serial then by date, and remove any objects with the same id (even if some values change like sniffer_serial).

This is what I got so far:

import json
from itertools import groupby

#json filepath
json_file_path = "./myfile.json"

#opening and loading the file content
with open(json_file_path, 'r') as j:
     contents = json.loads(j.read())

data = {} #dict that will contain my sorted data

#sorting data
for key, items in groupby(sorted(contents, key = lambda x: (x['serial'], x['date'])), key=lambda x: x['serial']):
     data[key] = list(items)

#saving it as new file
with open('datasorted.json', 'w') as f:
    f.write(str(data))

What i'm having trouble with is removing the duplicated objects that have the same id. Should I create another dict and iterate to see if already has an entry with the same id inside it ?

How I expect the final JSON file to look like:

[
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.5078125,
      "y":0.0234375,
      "z":-0.84375,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.028"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c9a",
      "temp":36.33,
      "x":-0.484375,
      "y":-0.0078125,
      "z":-0.859375,
      "rssi":-70,
      "id":-26648,
      "date":"2021-06-02/09:24:06.238"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.265625,
      "y":-0.0390625,
      "z":-0.9921875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.058"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.21875,
      "y":0.015625,
      "z":-0.9296875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.183"
   }
]

EDIT:

Creating a Pandas dataframe and trying to drop duplicates is raising the following error:

KeyError: Index(['id'], dtype='object')

Code:

dataPandas = pd.DataFrame.from_dict(data,orient='index')

dataPandas.drop_duplicates(subset="id",keep="first")
1
  • 1
    In your for key, items loop, items is an iterator that contains all the items in that group. If you only care about one of the items, just set that value: data[key] = list(items)[0]. Note though that your final data will be a dict. If you want it to be a list like it was before, do data = [] and data.append(list(items)[0]) Commented Jun 7, 2021 at 17:54

3 Answers 3

1

I see a few issues:

  1. You want a list but you add all the items you care about to a dict. Then you write this dict to your output json.
  2. In your for key, items loop, items is an iterator that contains all the items in that group. If you only care about one of the items (e.g. the first), just set that value like so: data[key] = list(items)[0]

Incorporating these changes, you'd get:

data = [] #dict that will contain my sorted data

#sorting data
for key, items in groupby(sorted(contents, key = lambda x: (x['serial'], x['date'])), key=lambda x: x['id']):
     data.append(next(items))

next(items) gets only the next item of the iterator. On the other hand, list(items)[0] would convert the entire iterator to a list, and then take the first element.

This gives us the following data:

print(json.dumps(data, indent=4))

[
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c94",
        "temp": 35.08,
        "x": -0.5078125,
        "y": 0.0234375,
        "z": -0.84375,
        "rssi": -87,
        "id": -26633,
        "date": "2021-06-02/09:24:06.028"
    },
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c9a",
        "temp": 36.33,
        "x": -0.484375,
        "y": -0.0078125,
        "z": -0.859375,
        "rssi": -70,
        "id": -26648,
        "date": "2021-06-02/09:24:06.238"
    },
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39d3b",
        "temp": 37.19,
        "x": -0.265625,
        "y": -0.0390625,
        "z": -0.9921875,
        "rssi": -86,
        "id": -30714,
        "date": "2021-06-02/09:24:06.058"
    }
]

One potential problem with this: You sort first and then do groupby. I'm not sure if this breaks the sort order, but you could always do the groupby first and then sort on the serial.

unique_contents = [next(v) for k, v in groupby(contents, key=lambda x: x['id'])]
data = sorted(unique_contents, key=lambda x: (x['serial'], x['date']))

Or in one line, use the generator expression that drove the unique_contents list comprehension as the iterator you are sorting:

data = sorted(
    (next(v) for k, v in groupby(contents, key=lambda x: x['id'])), 
    key=lambda x: (x['serial'], x['date'])
)

Also note: you can read and write json directly from the file:

#opening and loading the file content
with open(input_file_path, 'r') as j:
     contents = json.load(j)

with open(output_file_path, 'w') as j:
    json.dump(data, indent=4) # indent=4 for pretty-print
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much! I learned a lot from this answer, truly.
1

Your approach looks solid.

If you don't care about which of the duplicate elements you are using down the line, you can just take the first one:

...
for key, items in groupby(
    sorted(
        contents,
        key=lambda x: (x['serial'], x['date'])
    ),
    key=lambda x: x['serial']
):
    # items is an iterator and if you only care about the first element,
    # you can call next() once on it (instead of converting it to a list),
    # so that it doesn't iterate all entries.
    data[key] = next(items)

# Save as new file with indentation
with open('datasorted.json', 'w') as f:
    json.dump(data, f, indent=4)

The output will look like this:

{
    "086bd7c39c94": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c94",
        "temp": 35.08,
        "x": -0.5078125,
        "y": 0.0234375,
        "z": -0.84375,
        "rssi": -87,
        "id": -26633,
        "date": "2021-06-02/09:24:06.028"
    },
    "086bd7c39c9a": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c9a",
        "temp": 36.33,
        "x": -0.484375,
        "y": -0.0078125,
        "z": -0.859375,
        "rssi": -70,
        "id": -26648,
        "date": "2021-06-02/09:24:06.238"
    },
    "086bd7c39d3b": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39d3b",
        "temp": 37.19,
        "x": -0.265625,
        "y": -0.0390625,
        "z": -0.9921875,
        "rssi": -86,
        "id": -30714,
        "date": "2021-06-02/09:24:06.058"
    }
}

5 Comments

I think it is my fault for not making my question clear. This is a great solution but it's not what I'm looking for, because this only takes the first serial it finds to add to the dict. What I want is to have no duplicate id's, preserving the other objects (even with the same serial) so I can later plot this data trought time. If change key=lambda x: x['serial'] to key=lambda x: x['id'] it messes with the ordering.
I edited my question with how the JSON should look like to further clarify thinngs. Thanks for your help
@LeonardoFelixdaSilva why do you expect "serial": "086bd7c39c94" to be sorted after "serial": "086bd7c39c9a"? That is the wrong sort order "086bd7c39c9a" > "086bd7c39c94" (both lexicographically and if you interpreted it as a hex number)
@PranavHosangadi that happened because I was trying to do 2 things at the same time with not nearly as much attention as I should have given them. I'm sorry for that, I fixed that typo
I see, in that case, the accepted answer more accurate.
0

Consider using pandas to create a DataRame with your dictionary using pd.DataFrame.from_dict and then running the de-dupe (pandas.DataFrame.drop_duplicates) functions.

1 Comment

Can you please elaborate further ? I never used pandas before and I got an KeyError when trying to drop duplicates. I edited my question with the code I just tried

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.