Map-Reduce and Sharded Collections
Map-reduce supports operations on sharded collections, both as an inputand as an output. This section describes the behaviors ofmapReduce
specific to sharded collections.
However, starting in version 4.2, MongoDB deprecates the map-reduceoption to create a new sharded collection as well as the use of thesharded
option for map-reduce. To output to a sharded collection,create the sharded collection first. MongoDB 4.2 also deprecates thereplacement of an existing sharded collection.
Sharded Collection as Input
When using sharded collection as the input for a map-reduce operation,mongos
will automatically dispatch the map-reduce job toeach shard in parallel. There is no special optionrequired. mongos
will wait for jobs on all shards tofinish.
Sharded Collection as Output
If the out
field for mapReduce
has the sharded
value, MongoDB shards the output collection using the _id
field asthe shard key.
Note
Starting in version 4.2, MongoDB deprecates the use of thesharded
option formapReduce
/db.collection.mapReduce
.
To output to a sharded collection:
- If the output collection does not exist, create the shardedcollection first.
Starting in version 4.2, MongoDB deprecates the map-reduce option tocreate a new sharded collection and the use of the sharded
option for map-reduce. As such, to output to a sharded collection,create the sharded collection first.
If you did not create the sharded collection first, MongoDB createsand shards the collection on the _id
field. However, it isrecommended that you create the sharded collection first.
Starting in version 4.2, MongoDB deprecates the replacement of anexisting sharded collection.
Starting in version 4.0, if the output collection already exists butis not sharded, map-reduce fails.
For a new or an empty sharded collection, MongoDB uses the results ofthe first stage of the map-reduce operation to create the initialchunks distributed among the shards.
mongos
dispatches, in parallel, a map-reducepost-processing job to every shard that owns a chunk. During thepost-processing, each shard will pull the resultsfor its own chunks from the other shards, run the finalreduce/finalize, and write locally to the output collection.
Note
- During later map-reduce jobs, MongoDB splits chunks as needed.
- Balancing of chunks for the output collection is automaticallyprevented during post-processing to avoid concurrency issues.