Aggregation Examples
There are several methods of performing aggregations in MongoDB. Theseexamples cover the new aggregation framework, using map reduce and using thegroup method.
Setup
To start, we’ll insert some example data which we can performaggregations on:
- >>> from pymongo import MongoClient
- >>> db = MongoClient().aggregation_example
- >>> result = db.things.insert_many([{"x": 1, "tags": ["dog", "cat"]},
- ... {"x": 2, "tags": ["cat"]},
- ... {"x": 2, "tags": ["mouse", "cat", "dog"]},
- ... {"x": 3, "tags": []}])
- >>> result.inserted_ids
- [ObjectId('...'), ObjectId('...'), ObjectId('...'), ObjectId('...')]
Aggregation Framework
This example shows how to use theaggregate()
method to use the aggregationframework. We’ll perform a simple aggregation to count the number ofoccurrences for each tag in the tags
array, across the entire collection.To achieve this we need to pass in three operations to the pipeline.First, we need to unwind the tags
array, then group by the tags andsum them up, finally we sort by count.
As python dictionaries don’t maintain order you should use SON
or collections.OrderedDict
where explicit ordering is requiredeg “$sort”:
Note
aggregate requires server version >= 2.1.0.
- >>> from bson.son import SON
- >>> pipeline = [
- ... {"$unwind": "$tags"},
- ... {"$group": {"_id": "$tags", "count": {"$sum": 1}}},
- ... {"$sort": SON([("count", -1), ("_id", -1)])}
- ... ]
- >>> import pprint
- >>> pprint.pprint(list(db.things.aggregate(pipeline)))
- [{u'_id': u'cat', u'count': 3},
- {u'_id': u'dog', u'count': 2},
- {u'_id': u'mouse', u'count': 1}]
To run an explain plan for this aggregation use thecommand()
method:
- >>> db.command('aggregate', 'things', pipeline=pipeline, explain=True)
- {u'ok': 1.0, u'stages': [...]}
As well as simple aggregations the aggregation framework provides projectioncapabilities to reshape the returned data. Using projections and aggregation,you can add computed fields, create new virtual sub-objects, and extractsub-fields into the top-level of results.
See also
The full documentation for MongoDB’s aggregation framework
Map/Reduce
Another option for aggregation is to use the map reduce framework. Here wewill define map and reduce functions to also count the number ofoccurrences for each tag in the tags
array, across the entire collection.
Our map function just emits a single (key, 1) pair for each tag inthe array:
- >>> from bson.code import Code
- >>> mapper = Code("""
- ... function () {
- ... this.tags.forEach(function(z) {
- ... emit(z, 1);
- ... });
- ... }
- ... """)
The reduce function sums over all of the emitted values for a given key:
- >>> reducer = Code("""
- ... function (key, values) {
- ... var total = 0;
- ... for (var i = 0; i < values.length; i++) {
- ... total += values[i];
- ... }
- ... return total;
- ... }
- ... """)
Note
We can’t just return values.length
as the reduce functionmight be called iteratively on the results of other reduce steps.
Finally, we call map_reduce()
anditerate over the result collection:
- >>> result = db.things.map_reduce(mapper, reducer, "myresults")
- >>> for doc in result.find():
- ... pprint.pprint(doc)
- ...
- {u'_id': u'cat', u'value': 3.0}
- {u'_id': u'dog', u'value': 2.0}
- {u'_id': u'mouse', u'value': 1.0}
Advanced Map/Reduce
PyMongo’s API supports all of the features of MongoDB’s map/reduce engine.One interesting feature is the ability to get more detailed results whendesired, by passing full_response=True tomap_reduce()
. This returns the fullresponse to the map/reduce command, rather than just the result collection:
- >>> pprint.pprint(
- ... db.things.map_reduce(mapper, reducer, "myresults", full_response=True))
- {...u'counts': {u'emit': 6, u'input': 4, u'output': 3, u'reduce': 2},
- u'ok': ...,
- u'result': u'...',
- u'timeMillis': ...}
All of the optional map/reduce parameters are also supported, simply pass themas keyword arguments. In this example we use the query parameter to limit thedocuments that will be mapped over:
- >>> results = db.things.map_reduce(
- ... mapper, reducer, "myresults", query={"x": {"$lt": 2}})
- >>> for doc in results.find():
- ... pprint.pprint(doc)
- ...
- {u'_id': u'cat', u'value': 1.0}
- {u'_id': u'dog', u'value': 1.0}
You can use SON
or collections.OrderedDict
tospecify a different database to store the result collection:
- >>> from bson.son import SON
- >>> pprint.pprint(
- ... db.things.map_reduce(
- ... mapper,
- ... reducer,
- ... out=SON([("replace", "results"), ("db", "outdb")]),
- ... full_response=True))
- {...u'counts': {u'emit': 6, u'input': 4, u'output': 3, u'reduce': 2},
- u'ok': ...,
- u'result': {u'collection': ..., u'db': ...},
- u'timeMillis': ...}
See also
The full list of options for MongoDB’s map reduce engine