GROUP BY Clause
Specify the GROUP BY
clause in queries that use aggregation functions, such as [COUNT()]($38203069b9301e6b.md#count)
, [SUM()]($79326ebbfad28a19.md#sum)
, [AVG()]($9e608d6dbc85e4fe.md#avg)
, [MIN()]($72a839014b6964af.md#min)
, and [MAX()]($5188c27db7c618b9.md#max)
. Specify in the [GROUP BY]($a7355b6eff3af334.md#group_by)
clause the names of all the columns that do not participate in the aggregation operation.
Complex type considerations:
In Impala 2.3 and higher, the complex data types STRUCT
, ARRAY
, and MAP
are available. These columns cannot be referenced directly in the ORDER BY
clause. When you query a complex type column, you use join notation to “unpack” the elements of the complex type, and within the join query you can include an ORDER BY
clause to control the order in the result set of the scalar elements from the complex type. See Complex Types (Impala 2.3 or higher only) for details about Impala support for complex types.
Zero-length strings: For purposes of clauses such as DISTINCT
and GROUP BY
, Impala considers zero-length strings (""
), NULL
, and space to all be different values.
Examples:
For example, the following query finds the 5 items that sold the highest total quantity (using the SUM()
function, and also counts the number of sales transactions for those items (using the COUNT()
function). Because the column representing the item IDs is not used in any aggregation functions, we specify that column in the GROUP BY
clause.
select
ss_item_sk as Item,
count(ss_item_sk) as Times_Purchased,
sum(ss_quantity) as Total_Quantity_Purchased
from store_sales
group by ss_item_sk
order by sum(ss_quantity) desc
limit 5;
+-------+-----------------+--------------------------+
| item | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 9325 | 372 | 19072 |
| 4279 | 357 | 18501 |
| 7507 | 371 | 18475 |
| 5953 | 369 | 18451 |
| 16753 | 375 | 18446 |
+-------+-----------------+--------------------------+
The HAVING
clause lets you filter the results of aggregate functions, because you cannot refer to those expressions in the WHERE
clause. For example, to find the 5 lowest-selling items that were included in at least 100 sales transactions, we could use this query:
select
ss_item_sk as Item,
count(ss_item_sk) as Times_Purchased,
sum(ss_quantity) as Total_Quantity_Purchased
from store_sales
group by ss_item_sk
having times_purchased >= 100
order by sum(ss_quantity)
limit 5;
+-------+-----------------+--------------------------+
| item | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 13943 | 105 | 4087 |
| 2992 | 101 | 4176 |
| 4773 | 107 | 4204 |
| 14350 | 103 | 4260 |
| 11956 | 102 | 4275 |
+-------+-----------------+--------------------------+
When performing calculations involving scientific or financial data, remember that columns with type FLOAT
or DOUBLE
are stored as true floating-point numbers, which cannot precisely represent every possible fractional value. Thus, if you include a FLOAT
or DOUBLE
column in a GROUP BY
clause, the results might not precisely match literal values in your query or from an original Text data file. Use rounding operations, the BETWEEN
operator, or another arithmetic technique to match floating-point values that are “near” literal values you expect. For example, this query on the ss_wholesale_cost
column returns cost values that are close but not identical to the original figures that were entered as decimal fractions.
select ss_wholesale_cost, avg(ss_quantity * ss_sales_price) as avg_revenue_per_sale
from sales
group by ss_wholesale_cost
order by avg_revenue_per_sale desc
limit 5;
+-------------------+----------------------+
| ss_wholesale_cost | avg_revenue_per_sale |
+-------------------+----------------------+
| 96.94000244140625 | 4454.351539300434 |
| 95.93000030517578 | 4423.119941283189 |
| 98.37999725341797 | 4332.516490316291 |
| 97.97000122070312 | 4330.480601655014 |
| 98.52999877929688 | 4291.316953108634 |
+-------------------+----------------------+
Notice how wholesale cost values originally entered as decimal fractions such as 96.94
and 98.38
are slightly larger or smaller in the result set, due to precision limitations in the hardware floating-point types. The imprecise representation of FLOAT
and DOUBLE
values is why financial data processing systems often store currency using data types that are less space-efficient but avoid these types of rounding errors.
Related information:
SELECT Statement, Impala Aggregate Functions
Parent topic: SELECT Statement