6.1.6. The GROUP BY
clause
GROUP BY
merges output rows that have the same combination of values in its item list into a single row. Aggregate functions in the select list are applied to each group individually instead of to the dataset as a whole.
If the select list only contains aggregate columns or, more generally, columns whose values don’t depend on individual rows in the underlying set, GROUP BY
is optional. When omitted, the final result set of will consist of a single row (provided that at least one aggregated column is present).
If the select list contains both aggregate columns and columns whose values may vary per row, the GROUP BY
clause becomes mandatory.
Syntax
SELECT ... FROM ...
GROUP BY <grouping-item> [, <grouping-item> ...]
[HAVING <grouped-row-condition>]
...
<grouping-item> ::=
<non-aggr-select-item>
| <non-aggr-expression>
<non-aggr-select-item> ::=
column-copy
| column-alias
| column-position
Argument | Description |
---|---|
non-aggr-expression | Any non-aggregating expression that is not included in the |
column-copy | A literal copy, from the |
column-alias | The alias, from the |
column-position | The position number, in the |
A general rule of thumb is that every non-aggregate item in the SELECT
list must also be in the GROUP BY
list. You can do this in three ways:
By copying the item verbatim from the select list, e.g. “
class
” or “'D:' || upper(doccode)
”.By specifying the column alias, if it exists.
By specifying the column position as an integer literal between 1 and the number of columns. Integer values resulting from expressions or parameter substitutions are simply invariables and will be used as such in the grouping. They will have no effect though, as their value is the same for each row.
If you group by a column position, the expression at that position is copied internally from the select list. If it concerns a subquery, that subquery will be executed again in the grouping phase. That is to say, grouping by the column position, rather than duplicating the subquery expression in the grouping clause, saves keystrokes and bytes, but it is not a way of saving processing cycles! |
In addition to the required items, the grouping list may also contain:
Columns from the source table that are not in the select list, or non-aggregate expressions based on such columns. Adding such columns may further subdivide the groups. But since these columns are not in the select list, you can’t tell which aggregated row corresponds to which value in the column. So, in general, if you are interested in this information, you also include the column or expression in the select list — which brings you back to the rule: “every non-aggregate column in the select list must also be in the grouping list”.
Expressions that aren’t dependent on the data in the underlying set, e.g. constants, context variables, single-value non-correlated subselects etc. This is only mentioned for completeness, as adding such items is utterly pointless: they don’t affect the grouping at all. “Harmless but useless” items like these may also figure in the select list without being copied to the grouping list.
Examples
When the select list contains only aggregate columns, GROUP BY
is not mandatory:
select count(*), avg(age) from students
where sex = 'M';
This will return a single row listing the number of male students and their average age. Adding expressions that don’t depend on values in individual rows of table STUDENTS
doesn’t change that:
select count(*), avg(age), current_date from students
where sex = 'M';
The row will now have an extra column showing the current date, but other than that, nothing fundamental has changed. A GROUP BY
clause is still not required.
However, in both the above examples it is allowed. This is perfectly valid:
select count(*), avg(age) from students
where sex = 'M'
group by class;
and will return a row for each class that has boys in it, listing the number of boys and their average age in that particular class. (If you also leave the current_date
field in, this value will be repeated on every row, which is not very exciting.)
The above query has a major drawback though: it gives you information about the different classes, but it doesn’t tell you which row applies to which class. In order to get that extra bit of information, the non-aggregate column CLASS
must be added to the select list:
select class, count(*), avg(age) from students
where sex = 'M'
group by class;
Now we have a useful query. Notice that the addition of column CLASS
also makes the GROUP BY
clause mandatory. We can’t drop that clause anymore, unless we also remove CLASS
from the column list.
The output of our last query may look something like this:
CLASS | COUNT | AVG |
---|---|---|
2A | 12 | 13.5 |
2B | 9 | 13.9 |
3A | 11 | 14.6 |
3B | 12 | 14.4 |
… | … | … |
The headings “COUNT” and “AVG” are not very informative. In a simple case like this, you might get away with that, but in general you should give aggregate columns a meaningful name by aliasing them:
select class,
count(*) as num_boys,
avg(age) as boys_avg_age
from students
where sex = 'M'
group by class;
As you may recall from the formal syntax of the columns list, the AS
keyword is optional.
Adding more non-aggregate (or rather: row-dependent) columns requires adding them to the GROUP BY
clause too. For instance, you might want to see the above information for girls as well; and you may also want to differentiate between boarding and day students:
select class,
sex,
boarding_type,
count(*) as number,
avg(age) as avg_age
from students
group by class, sex, boarding_type;
This may give you the following result:
CLASS | SEX | BOARDING_TYPE | NUMBER | AVG_AGE |
---|---|---|---|---|
2A | F | BOARDING | 9 | 13.3 |
2A | F | DAY | 6 | 13.5 |
2A | M | BOARDING | 7 | 13.6 |
2A | M | DAY | 5 | 13.4 |
2B | F | BOARDING | 11 | 13.7 |
2B | F | DAY | 5 | 13.7 |
2B | M | BOARDING | 6 | 13.8 |
… | … | … | … | … |
Each row in the result set corresponds to one particular combination of the variables class, sex and boarding type. The aggregate results — number and average age — are given for each of these rather specific groups individually. In a query like this, you don’t see a total for boys as a whole, or day students as a whole. That’s the tradeoff: the more non-aggregate columns you add, the more you can pinpoint very specific groups, but the more you also lose sight of the general picture. Of course, you can still obtain the “coarser” aggregates through separate queries.
HAVING
Just as a WHERE
clause limits the rows in a dataset to those that meet the search condition, so the HAVING
subclause imposes restrictions on the aggregated rows in a grouped set. HAVING
is optional, and can only be used in conjunction with GROUP BY
.
The condition(s) in the HAVING
clause can refer to:
Any aggregated column in the select list. This is the most widely used alternative.
Any aggregated expression that is not in the select list, but allowed in the context of the query. This is sometimes useful too.
Any column in the
GROUP BY
list. While legal, it is more efficient to filter on these non-aggregated data at an earlier stage: in theWHERE
clause.Any expression whose value doesn’t depend on the contents of the dataset (like a constant or a context variable). This is valid but utterly pointless, because it will either suppress the entire set or leave it untouched, based on conditions that have nothing to do with the set itself.
A HAVING
clause can not contain:
Non-aggregated column expressions that are not in the
GROUP BY
list.Column positions. An integer in the
HAVING
clause is just an integer.Column aliases –- not even if they appear in the
GROUP BY
clause!
Examples
Building on our earlier examples, this could be used to skip small groups of students:
select class,
count(*) as num_boys,
avg(age) as boys_avg_age
from students
where sex = 'M'
group by class
having count(*) >= 5;
To select only groups that have a minimum age spread:
select class,
count(*) as num_boys,
avg(age) as boys_avg_age
from students
where sex = 'M'
group by class
having max(age) - min(age) > 1.2;
Notice that if you’re really interested in this information, you’d normally include min(age)
and max(age)
-– or the expression “max(age) - min(age)
” –- in the select list as well!
To include only 3rd classes:
select class,
count(*) as num_boys,
avg(age) as boys_avg_age
from students
where sex = 'M'
group by class
having class starting with '3';
Better would be to move this condition to the WHERE
clause:
select class,
count(*) as num_boys,
avg(age) as boys_avg_age
from students
where sex = 'M' and class starting with '3'
group by class;