0

I am new to MongoDB. I am trying to read data from several collections. I want to do some statistics on GHTorrent, so I am attempting to print a .csv with the data I'm interested in. The problem is that my query has now been running for some 30 minutes and I'm sure my search is less effective than it could be, I'm just not sure how to improve it.

First, I do

closed_issues = ghdb.issues.find(
    { "state": "closed" }, # query criteria
    { #projection
    "id": 1,
    "created_at": 1,
    "closed_at": 1,
    "comments": 1,
    "repo": 1,
    "owner": 1,
    "number": 1,
    }

Then, after opening a file and writing headlines, I do

for issue in closed_issues:
    countMentioned = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event": "mentioned" }).count();
    countSubscribed = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event": "subscribed" }).count();
    countAssigned = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event":  "assigned" }).count();
    time_created = parser.parse(issue['created_at'])
    time_closed = parser.parse(issue['closed_at'])
    timediff = time_closed - time_created;

    f.write(
        str(issue['id']) +","+
        str(issue['number']) +","+
        str(issue['repo']) +","+
        str(issue['owner']) +","+
        str(timediff.total_seconds()) +","+
        str(issue['comments']) +","+
        str(countMentioned) +","+
        str(countSubscribed) +","+
        str(countAssigned) +'\n'
        )

As you can see, I use three of the four same criteria for three different finds per issue. What is the most effective way of doing a search for one combination of issue_id, repo and owner and doing counts for each of three different event?

1 Answer 1

1

The mongodb aggregation framework is a great tool for queries that produce aggregated stats like counts - http://docs.mongodb.org/manual/core/aggregation/

I'd start there and play around with it a bit. For this kind of use case you can usually start there and then wrap a bit of additional code around the result to export the data in the format you need.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.