Python3, MongoDB Insert only if document does not exist

Question

I currently have a dictionary with data being pulled from an API, where I have given each datapoint it's own variable (job_id, jobtitle, company etc.):

output = {
        'ID': job_id, 
        'Title': jobtitle, 
        'Employer' : company, 
        'Employment type' : emptype, 
        'Fulltime' : tid, 
        'Deadline' : deadline, 
        'Link' : webpage
}

that I want to add to my database, easy enough:

db.jobs.insert_one(output)

but this is all in a for loop that will create 30-ish unique new documents, with names, titles, links and whatnot, this script will be run more than once, so what I would like for it to do is only insert the "output" as a document if it doesn't already exist in the database, all of these new documents do have their own unique ID's coming from the job_id variable am I able to check against that?

whoami - fakeFaceTrueSoul · Accepted Answer · 2020-03-23 02:48:09Z

3

You need to try two things :

1) Doing .find() & if no document found for given job_id then writing to DB is a two way call - Instead you can have an unique-index on job_id field, that will throw an error if your operation tries to insert duplicate document (Having unique index is much more safer way to avoid duplicates, even helpful if your code logic fails).

2) If you've 30 dict's - You no need to iterate for 30 times & use insert_one to make 30 database calls, instead you can use insert_many which takes in an array of dict's & writes to database.

Note : By default all dict's are written in the order they're in the array, in case if a dict fails cause of duplicate error then insert_many fails at that point without inserting rest others, So to overcome this you need to pass an option ordered=False that way all dictionaries will be inserted except duplicates.

answered Mar 23, 2020 at 2:48

whoami - fakeFaceTrueSoul

18.1k6 gold badges35 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Derpa Over a year ago

I've got insert_many working, but I can't seem to get 'job_id' working as a unique-index, getting "pymongo.errors.OperationFailure: Unknown index plugin '23964927'" as an error with: db.jobs.create_index([('ID', job_id)], unique=True as the code

whoami - fakeFaceTrueSoul Over a year ago

@Derpa : I've not really worked on pymongo but why don't you create unique index on db itself ? Do you want your code to do that (which can usually do when we need to make sure an index available all the time (if an index got deleted by any change) - otherwise we can just create index once on DB & then no need to check again)?

Derpa Over a year ago

Actually, I'm not sure that I understand the unique-index anymore, I thought that if I pull the data from an API, (it might be the same next week or it might be slightly different), but every post from that API has a unique ID. Wouldn't it be easier to check against that? Or is that what it does? Anyhow, doing it in the code would be nice since I'm going to do some testing and probably delete data a couple of times

whoami - fakeFaceTrueSoul Over a year ago

@Derpa : Deletion of data doesn't delete indexes on a collection (deleting index or dropping collection/db does delete indexes). So when the same unique ID comes in next week with updated data then you need to merge updated with existing right ? So do you want that or do you just don't insert/update the new data coming next week (just ignores)?

Derpa Over a year ago

Nope, it works great now! How would One merge and update with the new data? You don't have to answer that if you don't want to :) I was actually able to finish my script now, thank you for your help!

|

Belly Buster · Accepted Answer · 2020-03-23 11:26:23Z

1

EDIT:

replace

db.jobs.insert_one(output)

with

db.jobs.replace_one({'ID': job_id}, output, upsert=True)

ORIGINAL ANSWER with worked example:

Use replace_one() with upsert=True. You can run this multiple times and it will with insert if the ID isn't found or replace if it is found. It wasn't quite what you were asking as the data is always updated (so newer data will overwrite any existing data).

from pymongo import MongoClient


db = MongoClient()['mydatabase']

for i in range(30):
    db.employer.replace_one({'ID': i},
    {
            'ID': i,
            'Title': 'jobtitle',
            'Employer' : 'company',
            'Employment type' : 'emptype',
            'Fulltime' : 'tid',
            'Deadline' : 'deadline',
            'Link' : 'webpage'
    }, upsert=True)

# Should always print 30 regardless of number of times run.
print(db.employer.count_documents({}))

edited Mar 23, 2020 at 11:26

answered Mar 23, 2020 at 10:36

Belly Buster

8,9142 gold badges12 silver badges25 bronze badges

4 Comments

Derpa Over a year ago

So what the script does is pull some information from an API, atm it produces around 30 docs, but it might as well be 50 or 27, what I want to be able to do is run this script over and over, but only write new data if it doesn't already exist in the collection or database, so that I wont end up with duplicates of the same information in the database, the "job_id" variable pulls a unique ID from the API that belongs with the rest of the information in each document

Belly Buster Over a year ago

Yeah sorry I should have explained. The code I provided was just a sample to show how it could work. All you need to do is replace db.jobs.insert_one(output) with db.jobs.replace_one({'ID': job_id}, output, upsert=True). I've updated the answer.

whoami - fakeFaceTrueSoul Over a year ago

@BellyBuster : This might or might not work !! If you've inserted a doc today & have done couple of updates on it, a duplicate inserting tomorrow would overwrite the entire doc with any new values or will remove the fields if by any chance they aren't present in request(majorly if same op is being done every other day)..

Belly Buster Over a year ago

That is correct; if that isn't what the asker wants they will need to find a different method, e.g. the one you posted.

Collectives™ on Stack Overflow

Python3, MongoDB Insert only if document does not exist

2 Answers 2

10 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related