2

I currently have a dictionary with data being pulled from an API, where I have given each datapoint it's own variable (job_id, jobtitle, company etc.):

output = {
        'ID': job_id, 
        'Title': jobtitle, 
        'Employer' : company, 
        'Employment type' : emptype, 
        'Fulltime' : tid, 
        'Deadline' : deadline, 
        'Link' : webpage
}

that I want to add to my database, easy enough:

db.jobs.insert_one(output)

but this is all in a for loop that will create 30-ish unique new documents, with names, titles, links and whatnot, this script will be run more than once, so what I would like for it to do is only insert the "output" as a document if it doesn't already exist in the database, all of these new documents do have their own unique ID's coming from the job_id variable am I able to check against that?

2 Answers 2

3

You need to try two things :

1) Doing .find() & if no document found for given job_id then writing to DB is a two way call - Instead you can have an unique-index on job_id field, that will throw an error if your operation tries to insert duplicate document (Having unique index is much more safer way to avoid duplicates, even helpful if your code logic fails).

2) If you've 30 dict's - You no need to iterate for 30 times & use insert_one to make 30 database calls, instead you can use insert_many which takes in an array of dict's & writes to database.

Note : By default all dict's are written in the order they're in the array, in case if a dict fails cause of duplicate error then insert_many fails at that point without inserting rest others, So to overcome this you need to pass an option ordered=False that way all dictionaries will be inserted except duplicates.

Sign up to request clarification or add additional context in comments.

10 Comments

I've got insert_many working, but I can't seem to get 'job_id' working as a unique-index, getting "pymongo.errors.OperationFailure: Unknown index plugin '23964927'" as an error with: db.jobs.create_index([('ID', job_id)], unique=True as the code
@Derpa : I've not really worked on pymongo but why don't you create unique index on db itself ? Do you want your code to do that (which can usually do when we need to make sure an index available all the time (if an index got deleted by any change) - otherwise we can just create index once on DB & then no need to check again)?
Actually, I'm not sure that I understand the unique-index anymore, I thought that if I pull the data from an API, (it might be the same next week or it might be slightly different), but every post from that API has a unique ID. Wouldn't it be easier to check against that? Or is that what it does? Anyhow, doing it in the code would be nice since I'm going to do some testing and probably delete data a couple of times
@Derpa : Deletion of data doesn't delete indexes on a collection (deleting index or dropping collection/db does delete indexes). So when the same unique ID comes in next week with updated data then you need to merge updated with existing right ? So do you want that or do you just don't insert/update the new data coming next week (just ignores)?
Nope, it works great now! How would One merge and update with the new data? You don't have to answer that if you don't want to :) I was actually able to finish my script now, thank you for your help!
|
1

EDIT:

replace

db.jobs.insert_one(output)

with

db.jobs.replace_one({'ID': job_id}, output, upsert=True)

ORIGINAL ANSWER with worked example:

Use replace_one() with upsert=True. You can run this multiple times and it will with insert if the ID isn't found or replace if it is found. It wasn't quite what you were asking as the data is always updated (so newer data will overwrite any existing data).

from pymongo import MongoClient


db = MongoClient()['mydatabase']

for i in range(30):
    db.employer.replace_one({'ID': i},
    {
            'ID': i,
            'Title': 'jobtitle',
            'Employer' : 'company',
            'Employment type' : 'emptype',
            'Fulltime' : 'tid',
            'Deadline' : 'deadline',
            'Link' : 'webpage'
    }, upsert=True)

# Should always print 30 regardless of number of times run.
print(db.employer.count_documents({}))

4 Comments

So what the script does is pull some information from an API, atm it produces around 30 docs, but it might as well be 50 or 27, what I want to be able to do is run this script over and over, but only write new data if it doesn't already exist in the collection or database, so that I wont end up with duplicates of the same information in the database, the "job_id" variable pulls a unique ID from the API that belongs with the rest of the information in each document
Yeah sorry I should have explained. The code I provided was just a sample to show how it could work. All you need to do is replace db.jobs.insert_one(output) with db.jobs.replace_one({'ID': job_id}, output, upsert=True). I've updated the answer.
@BellyBuster : This might or might not work !! If you've inserted a doc today & have done couple of updates on it, a duplicate inserting tomorrow would overwrite the entire doc with any new values or will remove the fields if by any chance they aren't present in request(majorly if same op is being done every other day)..
That is correct; if that isn't what the asker wants they will need to find a different method, e.g. the one you posted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.