Speed up Python MongoDB query

Question

I am new to MongoDB. I am trying to read data from several collections. I want to do some statistics on GHTorrent, so I am attempting to print a .csv with the data I'm interested in. The problem is that my query has now been running for some 30 minutes and I'm sure my search is less effective than it could be, I'm just not sure how to improve it.

First, I do

closed_issues = ghdb.issues.find(
    { "state": "closed" }, # query criteria
    { #projection
    "id": 1,
    "created_at": 1,
    "closed_at": 1,
    "comments": 1,
    "repo": 1,
    "owner": 1,
    "number": 1,
    }

Then, after opening a file and writing headlines, I do

for issue in closed_issues:
    countMentioned = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event": "mentioned" }).count();
    countSubscribed = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event": "subscribed" }).count();
    countAssigned = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event":  "assigned" }).count();
    time_created = parser.parse(issue['created_at'])
    time_closed = parser.parse(issue['closed_at'])
    timediff = time_closed - time_created;

    f.write(
        str(issue['id']) +","+
        str(issue['number']) +","+
        str(issue['repo']) +","+
        str(issue['owner']) +","+
        str(timediff.total_seconds()) +","+
        str(issue['comments']) +","+
        str(countMentioned) +","+
        str(countSubscribed) +","+
        str(countAssigned) +'\n'
        )

As you can see, I use three of the four same criteria for three different finds per issue. What is the most effective way of doing a search for one combination of issue_id, repo and owner and doing counts for each of three different event?

John Petrone · Accepted Answer · 2014-04-22 00:10:35Z

1

The mongodb aggregation framework is a great tool for queries that produce aggregated stats like counts - http://docs.mongodb.org/manual/core/aggregation/

I'd start there and play around with it a bit. For this kind of use case you can usually start there and then wrap a bit of additional code around the result to export the data in the format you need.

answered Apr 22, 2014 at 0:10

John Petrone

27.5k6 gold badges68 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Speed up Python MongoDB query

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related