MongoDB + Python - very slow simple query

Question

I have an open source energy monitor (http://openenergymonitor.org) which logs the power usage of my house every five seconds, so I thought this would be a perfect application to play with MongoDB. I have a Flask Python application running in Apache using MongoEngine to interface with MongoDB.

Now I am running all of this on a RaspberryPi, so I'm not expecting incredible performance, but a simple query is taking around 20 seconds, which seems slow even for this limited hardware.

I have the following model:

class Reading(db.Document):
    created_at = db.DateTimeField(default=datetime.datetime.now, required=True)
    created_at_year = db.IntField(default=datetime.datetime.now().year, required=True)
    created_at_month = db.IntField(default=datetime.datetime.now().month, required=True)
    created_at_day = db.IntField(default=datetime.datetime.now().day, required=True)
    created_at_hour = db.IntField(default=datetime.datetime.now().hour, required=True)
    battery = db.IntField()
    power = db.IntField()
    meta = {
        'indexes': ['created_at_year', 'created_at_month', 'created_at_day', 'created_at_hour']
    }

I currently have around 36,000 readings stored from the last couple of days. The following code runs super quick:

def get_readings_count():
    count = '<p>Count: %d</p>' % Reading.objects.count()
    return count

def get_last_24_readings_as_json():
    readings = Reading.objects.order_by('-id')[:24]
    result = "["
    for reading in reversed(readings):
        result += str(reading.power) + ","
    result = result[:-1]
    result += "]"
    return result

But doing a simple filter:

def get_today_readings_count():
    todaycount = '<p>Today: %d</p>' % Reading.objects(created_at_year=2014, created_at_month=1, created_at_day=28).count()
    return todaycount

Takes around 20 seconds - there are around 11,000 readings for today.

Shall I give up expecting anything more of my Pi, or is there some tuning I can do to get more performance from MongoDB?

Mongo 2.1.1 on Debian Wheezy

Update 29/1/2014:

In response to an answer below, here are the results of getIndexes() and explain():

> db.reading.getIndexes()
[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "ns" : "sensor_network.reading",
        "name" : "_id_"
    },
    {
        "v" : 1,
        "key" : {
            "created_at_year" : 1
        },
        "ns" : "sensor_network.reading",
        "name" : "created_at_year_1",
        "background" : false,
        "dropDups" : false
    },
    {
        "v" : 1,
        "key" : {
            "created_at_month" : 1
        },
        "ns" : "sensor_network.reading",
        "name" : "created_at_month_1",
        "background" : false,
        "dropDups" : false
    },
    {
        "v" : 1,
        "key" : {
            "created_at_day" : 1
        },
        "ns" : "sensor_network.reading",
        "name" : "created_at_day_1",
        "background" : false,
        "dropDups" : false
    },
    {
        "v" : 1,
        "key" : {
            "created_at_hour" : 1
        },
        "ns" : "sensor_network.reading",
        "name" : "created_at_hour_1",
        "background" : false,
        "dropDups" : false
    }
]

> db.reading.find({created_at_year: 2014, created_at_month: 1, created_at_day: 28 }).explain()
{
    "cursor" : "BtreeCursor created_at_day_1",
    "isMultiKey" : false,
    "n" : 15689,
    "nscannedObjects" : 15994,
    "nscanned" : 15994,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 5,
    "nChunkSkips" : 0,
    "millis" : 25511,
    "indexBounds" : {
        "created_at_day" : [
            [
                28,
                28
            ]
        ]
    },
    "server" : "raspberrypi:27017"
}

Update 4 Feb

Okay, so I deleted the indexes, set a new one on created_at, deleted all the records and left it a day to collect new data. I've just run a query for today's data and it took longer (48 seconds):

> db.reading.find({'created_at': {'$gte':ISODate("2014-02-04")}}).explain()
{
    "cursor" : "BtreeCursor created_at_1",
    "isMultiKey" : false,
    "n" : 14189,
    "nscannedObjects" : 14189,
    "nscanned" : 14189,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 9,
    "nChunkSkips" : 0,
    "millis" : 48653,
    "indexBounds" : {
        "created_at" : [
            [
                ISODate("2014-02-04T00:00:00Z"),
                ISODate("292278995-12-2147483314T07:12:56.808Z")
            ]
        ]
    },
    "server" : "raspberrypi:27017"
}

That's with only 16,177 records in the database and only one index. There's around 111MB of free memory, so there shouldn't be an issue with the index fitting in memory. I guess I'm going to have to write this off as the Pi not being powerful enough for this job.

xlembouras · Accepted Answer · 2014-01-29 21:48:17Z

1

Are you sure that your index is created? could you provide the output of getIndexes() of your collection

eg: db.my_collection.getIndexes()

and the explanation of your query

db.my_collection.find({created_at_year: 2014, created_at_month: 1, created_at_day: 28 }).explain()

PS: of course I must agree with @Aesthete about the fact that you store much more than you need to...

29/1/2014 update

Perfect! As you see you have four different indexes when you can create ONE compound index which will include all of them.

defining

db.my_collection.ensureIndex({created_at_year: 1, created_at_month: 1, created_at_day: 1, created_at_hour: 1 })

will provide you a more precise index that will enable you to query for:

year
year and month
year and month and day
year and month and day and hour

This will make your queries (with the four keys) much faster, because all your criteria will be met in the index data!

please note that that the order of keys in ensureIndex() is crucial, that order actually defines the above mentioned list of queries!

Also note that if all you need is these 4 fields, than if you specify a correct projection
eg:
db.my_collection.find({created_at_year: 2014, created_at_month: 1, created_at_day: 28}, { created_at_year: 1, created_at_month: 1, created_at_day: 1 })

then only the index will be used, which is the maximum performance!

edited Jan 29, 2014 at 21:48

answered Jan 28, 2014 at 23:41

xlembouras

8,3254 gold badges35 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

littlecharva Over a year ago

Thanks for the response, I've updated my post with the results.

littlecharva Over a year ago

Thanks again - I haven't added the compound index yet, but have re-run the explain query again targeting only the one index: db.reading.find({created_at_day: 28 }).explain() and this still runs to 13 seconds. Is that the best performance I can hope for?

xlembouras Over a year ago

@littlecharva it depends, most probably yes, unless all you want is created_at_day field (or in general the keys that are on the compound index), in that case the query won't need to check the data on the db, it will only look in the index, which is the fastest you can get!

Aesthete · Accepted Answer · 2014-01-28 20:50:11Z

0

probably something to do with you saving the date 5 times save it once (ie keep created_at), then if you want the month, day etc in your view, just convert the created_at value to just display the month, day etc

answered Jan 28, 2014 at 20:50

Aesthete

1332 silver badges8 bronze badges

2 Comments

littlecharva Over a year ago

I'm breaking the datetime down into its component parts because I plan on aggregating the data in various ways with mapreduce, so rather than have to extract the day or hour in each map function, it's already there for me to use.

Aesthete Over a year ago

try and use just created_at, and write functions to get the date,month etc from just that value, doing it that way may boost performace, you could time the two approaches and see which one is better, you may stumble upon an optimal balance . I should also add, doing it the way ive mentioned would require less database hits than the way you've done it

Mzzl · Accepted Answer · 2014-01-29 21:56:27Z

0

I wonder if the indexes don't fit in your raspberry pi's memory. Since MongoDB can only use one index per query, and it seems to use only the created_by_day query, you could try dropping the indexes and replacing them with an index on the created_at timestamp. Then you could reduce the size of your documents by getting rid of the created_at_* fields.

You can easily extract the day, month, year etc from an ISO date in a map reduce function, or with the aggregation framework date operators.

The query for today then becomes something like this:

db.reading.find({'created_at':{'$gte':ISODate("2014-01-29"), '$lt':ISODate("2014-01-30")}})

I think it's interesting that you chose a database advertised as suitable for BIG data to run on your embedded device. I'm curious how it will work out. I have a similar gadget, and used BerkeleyDB for storing the readings. Don't forget that MongoDB on a 32 bit OS has a maximum size of 2GB for the entire database.

edited Jan 29, 2014 at 21:56

answered Jan 29, 2014 at 21:50

Mzzl

4,16630 silver badges40 bronze badges

1 Comment

littlecharva Over a year ago

See my update to the original post for the results of using one index. I chose to use a BIG DATA database as I wanted to play with it and it felt like having a sensor reading every 5 seconds and using a low powered device was almost like a mini-Big Data project. I'll look into BerkeleyDB, thanks.

Collectives™ on Stack Overflow

MongoDB + Python - very slow simple query

3 Answers 3

3 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related