I am new to MongoDB. I am trying to read data from several collections. I want to do some statistics on GHTorrent, so I am attempting to print a .csv with the data I'm interested in. The problem is that my query has now been running for some 30 minutes and I'm sure my search is less effective than it could be, I'm just not sure how to improve it.
First, I do
closed_issues = ghdb.issues.find(
{ "state": "closed" }, # query criteria
{ #projection
"id": 1,
"created_at": 1,
"closed_at": 1,
"comments": 1,
"repo": 1,
"owner": 1,
"number": 1,
}
Then, after opening a file and writing headlines, I do
for issue in closed_issues:
countMentioned = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "mentioned" }).count();
countSubscribed = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "subscribed" }).count();
countAssigned = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "assigned" }).count();
time_created = parser.parse(issue['created_at'])
time_closed = parser.parse(issue['closed_at'])
timediff = time_closed - time_created;
f.write(
str(issue['id']) +","+
str(issue['number']) +","+
str(issue['repo']) +","+
str(issue['owner']) +","+
str(timediff.total_seconds()) +","+
str(issue['comments']) +","+
str(countMentioned) +","+
str(countSubscribed) +","+
str(countAssigned) +'\n'
)
As you can see, I use three of the four same criteria for three different finds per issue. What is the most effective way of doing a search for one combination of issue_id, repo and owner and doing counts for each of three different event?