1

First time here - please go easy… ;)

I'm starting off with MongoDB for the first time - using the offical PHP driver to interact with an application. Here's the first problem I've ran into with regards to the aggregation framework. I have a collection of documents, all of which contain an array of numbers, like in the following shortened example...

{
  "_id": ObjectId("51c42c1218ef9de420000002"),
  "my_id": 1,
  "numbers": [
    482,
    49,
    382,
    290,
    31,
    126,
    997,
    20,
    145
  ],

}

{
  "_id": ObjectId("51c42c1218ef9de420000006"),
  "my_id": 2,
  "numbers": [
    19,
    234,
    28,
    962,
    24,
    12,
    8,
    643,
    145
  ],

}

{
  "_id": ObjectId("51c42c1218ef9de420000008"),
  "my_id": 3,
  "numbers": [
    912,
    18,
    456,
    34,
    284,
    556,
    95,
    125,
    579
  ],

}

{
  "_id": ObjectId("51c42c1218ef9de420000012"),
  "my_id": 4,
  "numbers": [
    12,
    97,
    227,
    872,
    103,
    78,
    16,
    377,
    20
  ],

}

{
  "_id": ObjectId("51c42c1218ef9de420000016"),
  "my_id": 5,
  "numbers": [
    212,
    237,
    103,
    93,
    55,
    183,
    193,
    17,
    346
  ],

}

Using the aggregation framework and PHP (which I think is the correct way), I'm trying to work out the average amount of times a number doesn't appear in a collection (within the numbers array) before it appears again. For example, the average amount of times the number 20 doesn't appear in the above example is 1.5 (there's a gap of 2 collections, followed by a gap of 1 - add these values together, divide by number of gaps). I can get as far as working out if the number 20 is within the results array, and then using the $cond operator, passing a value based on the result. Here’s my PHP…

$unwind_results = array(
    '$unwind' => '$numbers'
);

$project = array (
    '$project' => array(
        'my_id' => '$my_id',
        'numbers' => '$numbers',
        'hit' => array('$cond' => array(
            array(
                '$eq' => array('$numbers',20)
                 ),
            0,
            1
            )
        )
    )
);

$group = array (
    '$group' => array(
        '_id' => '$my_id',
        'hit' => array('$min'=>'$hit'),
    )
);

$sort = array(
    '$sort' => array( '_id' => 1 ),
);


$avg = $c->aggregate(array($unwind_results,$project, $group,  $sort));

What I was trying to achieve, was to setup up some kind of incremental counter that reset everytime the number 20 appeared in the numbers array, and then grab all of those numbers and work out the average from there…But im truly stumped.

I know I could work out the average from a collection of documents on the application side, but ideally I’d like Mongo to give me the result I want so it’s more portable.

Would Map/Reduce need to get involved somewhere?

Any help/advice/pointers greatly received!

1
  • you can't do it in the current version. Commented Jun 26, 2013 at 0:25

1 Answer 1

1

As Asya said, the aggregation framework isn't usable for the last part of your problem (averaging gaps in "hits" between documents in the pipeline). Map/reduce also doesn't seem well-suited to this task, since you need to process the documents serially (and in a sorted order) for this computation and MR emphasizes parallel processing.

Given that the aggregation framework does process documents in a sorted order, I was brainstorming yesterday about how it might support your use case. If $group exposed access to its accumulator values during the projection (in addition to the document being processed), we might be able to use $push to collect previous values in a projected array and then inspect them during a projection to compute these "hit" gaps. Alternatively, if there was some facility to access the previous document encountered by a $group for our bucket (i.e. group key), this could allow us to determine diffs and compute the gap span as well.

I shared those thoughts with Mathias, who works on the framework, and he explained that while all of this might be possible for a single server (were the functionality implemented), it would not work at all on a sharded infrastructure, where $group and $sort operations are distributed. It would not be a portable solution.

I think you're best option is to run the aggregation with the $project you have, and then process those results in your application language.

Sign up to request clarification or add additional context in comments.

1 Comment

jmikola, yep, i was thinking along your lines - the problem as you noted was getting the previous values. I've ended up working out the average on the application side anyhow - i'd also be implementing this on a sharded infrastructure. thanks for giving it some thought!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.