Aggregation with the Zip Code Data Set
The examples in this document use the zipcodes
collection. Thiscollection is available at: media.mongodb.org/zips.json. Use mongoimport
toload this data set into your mongod
instance.
Data Model
Each document in the zipcodes
collection has the following form:
- {
- "_id": "10280",
- "city": "NEW YORK",
- "state": "NY",
- "pop": 5574,
- "loc": [
- -74.016323,
- 40.710537
- ]
- }
- The
_id
field holds the zip code as a string. - The
city
field holds the city name. A city can have more than onezip code associated with it as different sections of the city caneach have a different zip code. - The
state
field holds the two letter state abbreviation. - The
pop
field holds the population. - The
loc
field holds the location as a latitude longitude pair.
aggregate() Method
All of the following examples use the aggregate()
helper in the mongo
shell.
The aggregate()
method uses theaggregation pipeline to processesdocuments into aggregated results. An aggregation pipeline consists of stages with each stage processingthe documents as they pass along the pipeline. Documents pass throughthe stages in sequence.
The aggregate()
method in themongo
shell provides a wrapper around theaggregate
database command. See the documentation for yourdriver for a more idiomatic interfacefor data aggregation operations.
Return States with Populations above 10 Million
The following aggregation operation returns all states with totalpopulation greater than 10 million:
- db.zipcodes.aggregate( [
- { $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
- { $match: { totalPop: { $gte: 10*1000*1000 } } }
- ] )
In this example, the aggregation pipelineconsists of the $group
stage followed by the$match
stage:
- The
$group
stage groups the documents of thezipcode
collection by thestate
field, calculates thetotalPop
fieldfor each state, and outputs a document for each unique state.
The new per-state documents have two fields: the _id
field andthe totalPop
field. The _id
field contains the value of thestate
; i.e. the group by field. The totalPop
field is acalculated field that contains the total population of each state. Tocalculate the value, $group
uses the $sum
operator to add the population field (pop
) for each state.
After the $group
stage, the documents in thepipeline resemble the following:
- {
- "_id" : "AK",
- "totalPop" : 550043
- }
- The
$match
stage filters these grouped documents tooutput only those documents whosetotalPop
value is greater thanor equal to 10 million. The$match
stage does not alterthe matching documents but outputs the matching documents unmodified.
The equivalent SQL for this aggregation operation is:
- SELECT state, SUM(pop) AS totalPop
- FROM zipcodes
- GROUP BY state
- HAVING totalPop >= (10*1000*1000)
See also
Return Average City Population by State
The following aggregation operation returns the average populations forcities in each state:
- db.zipcodes.aggregate( [
- { $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
- { $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } }
- ] )
In this example, the aggregation pipelineconsists of the $group
stage followed by another$group
stage:
- The first
$group
stage groups the documents by thecombination ofcity
andstate
, uses the$sum
expression to calculate the population for each combination, andoutputs a document for eachcity
andstate
combination.[1]
After this stage in the pipeline, the documents resemble thefollowing:
- {
- "_id" : {
- "state" : "CO",
- "city" : "EDGEWATER"
- },
- "pop" : 13154
- }
- A second
$group
stage groups the documents in thepipeline by the_id.state
field (i.e. thestate
field insidethe_id
document), uses the$avg
expression to calculatethe average city population (avgCityPop
) for each state, andoutputs a document for each state.
The documents that result from this aggregation operation resembles thefollowing:
- {
- "_id" : "MN",
- "avgCityPop" : 5335
- }
See also
Return Largest and Smallest Cities by State
The following aggregation operation returns the smallest and largestcities by population for each state:
- db.zipcodes.aggregate( [
- { $group:
- {
- _id: { state: "$state", city: "$city" },
- pop: { $sum: "$pop" }
- }
- },
- { $sort: { pop: 1 } },
- { $group:
- {
- _id : "$_id.state",
- biggestCity: { $last: "$_id.city" },
- biggestPop: { $last: "$pop" },
- smallestCity: { $first: "$_id.city" },
- smallestPop: { $first: "$pop" }
- }
- },
- // the following $project is optional, and
- // modifies the output format.
- { $project:
- { _id: 0,
- state: "$_id",
- biggestCity: { name: "$biggestCity", pop: "$biggestPop" },
- smallestCity: { name: "$smallestCity", pop: "$smallestPop" }
- }
- }
- ] )
In this example, the aggregation pipelineconsists of a $group
stage, a $sort
stage,another $group
stage, and a $project
stage:
- The first
$group
stage groups the documents by thecombination of thecity
andstate
, calculates thesum
of thepop
values for each combination, and outputs adocument for eachcity
andstate
combination.
At this stage in the pipeline, the documents resemble the following:
- {
- "_id" : {
- "state" : "CO",
- "city" : "EDGEWATER"
- },
- "pop" : 13154
- }
The
$sort
stage orders the documents in the pipeline bythepop
field value, from smallest to largest; i.e. byincreasing order. This operation does not alter the documents.The next
$group
stage groups the now-sorted documentsby the_id.state
field (i.e. thestate
field inside the_id
document) and outputs a document for each state.
The stage also calculates the following four fields for each state.Using the $last
expression, the $group
operatorcreates the biggestCity
and biggestPop
fields that store thecity with the largest population and that population. Using the$first
expression, the $group
operator createsthe smallestCity
and smallestPop
fields that store the citywith the smallest population and that population.
The documents, at this stage in the pipeline, resemble the following:
- {
- "_id" : "WA",
- "biggestCity" : "SEATTLE",
- "biggestPop" : 520096,
- "smallestCity" : "BENGE",
- "smallestPop" : 2
- }
- The final
$project
stage renames the_id
field tostate
and moves thebiggestCity
,biggestPop
,smallestCity
, andsmallestPop
intobiggestCity
andsmallestCity
embedded documents.
The output documents of this aggregation operation resemble the following:
- {
- "state" : "RI",
- "biggestCity" : {
- "name" : "CRANSTON",
- "pop" : 176404
- },
- "smallestCity" : {
- "name" : "CLAYVILLE",
- "pop" : 45
- }
- }
[1] | A city can have more than one zipcode associated with it as different sections of the city can eachhave a different zip code. |