How do I: remove duplicates?
PRQL doesn’t have a specific distinct
keyword. Instead duplicate tuples in a relation can be removed by using group
and take 1
:
PRQL
from employees
select department
group employees.* (
take 1
)
SQL
SELECT
DISTINCT department
FROM
employees
This also works with a wildcard:
PRQL
from employees
group employees.* (take 1)
SQL
SELECT
DISTINCT *
FROM
employees
Remove duplicates from each group?
To select a single row from each group group
can be combined with sort
and take
:
PRQL
# youngest employee from each department
from employees
group department (
sort age
take 1
)
SQL
WITH table_0 AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY department
ORDER BY
age
) AS _expr_0
FROM
employees
)
SELECT
*
FROM
table_0
WHERE
_expr_0 <= 1
Note that we can’t always compile to DISTINCT
; when the columns in the group
aren’t all the available columns, we need to use a window function:
PRQL
from employees
group {first_name, last_name} (take 1)
SQL
WITH table_0 AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name) AS _expr_0
FROM
employees
)
SELECT
*
FROM
table_0
WHERE
_expr_0 <= 1
When compiling to Postgres or DuckDB dialect, such queries will be compiled to DISTINCT ON
, which is the most performant option.
PRQL
prql target:sql.postgres
from employees
group department (
sort age
take 1
)
SQL
SELECT
DISTINCT ON (department) *
FROM
employees
ORDER BY
department,
age