Relations
PRQL is designed on top of relational algebra, which is the established data model used by modern SQL databases. A relation has a rigid mathematical definition, which can be simplified to “a table of data”. For example, the invoices
table from the Chinook database (https://github.com/lerocha/chinook-database) looks like this:
invoice_id | customer_id | billing_city | other columns | total |
---|---|---|---|---|
1 | 2 | Stuttgart | … | 1.98 |
2 | 4 | Oslo | … | 3.96 |
3 | 8 | Brussels | … | 5.94 |
4 | 14 | Edmonton | … | 8.91 |
5 | 23 | Boston | … | 13.86 |
6 | 37 | Frankfurt | … | 0.99 |
A relation is composed of rows. Each row in a relation contains a value for each of the relation’s columns. Each column in a relation has an unique name and a designated data type. The table above is a relation, and has columns named invoice_id
and customer_id
each with a data type of “integer number”, a billing_city
column with a data type of “text”, several other columns, and a total
column that contains floating-point numbers.
Queries
The main purpose of PRQL is to build queries that combine and transform data from relations such as the invoices
table above. Here is the most basic query:
from invoices
Note
Try each of these examples here in the Playground. Enter the query on the left-hand side, and click output.arrow in the right-hand side to see the result.
The result of the query above is not terribly interesting, it’s just the same relation as before.
select transform
The select
function picks the columns to pass through based on a list and discards all others. Formally, that list is a tuple of comma-separated expressions wrapped in { ... }
.
Suppose we only need the order_id
and total
columns. Use select
to choose the columns to pass through. (Try it in the Playground.)
from invoices
select { order_id, total }
We can write the items in the tuple on one or several lines: trailing commas are ignored. In addition, we can assign any of the expressions to a variable that becomes the name of the resulting column in the SQL output.
from invoices
select {
OrderID = invoice_id,
Total = total,
}
This is the same query as above, rewritten on multiple lines, and assigning OrderID
and Total
names to the columns.
Once we select
certain columns, subsequent transforms will have access only to those columns named in the tuple.
derive transform
To add columns to a relation, we can use derive
function. Let’s define a new column for Value Added Tax, set at 19% of the invoice total.
from invoices
derive { VAT = total * 0.19 }
The value of the new column can be a constant (such as a number or a string), or can be computed from the value of an existing column. Note that the value of the new column is assigned the name VAT
.
join transform
The join
transform also adds columns to the relation by combining the rows from two relations “side by side”. To determine which rows from each relation should be joined, join
has match criteria, written in ( ... )
.
from invoices
join customers ( ==customer_id )
This example “connects” the customer information from the customers
relation with the information from the invoices
relation, using identical values of the customer_id
column from each relation to match the rows.
It is frequently useful to assign an alias to both relations being joined together so that each relation’s columns can be referred to uniquely.
from inv=invoices
join cust=customers ( ==customer_id )
In the example above, the alias inv
represents the invoices
relation and cust
represents the customers
relation. It then becomes possible to refer to inv.billing_city
and cust.last_name
unambiguously.
Summary
PRQL manipulates relations (tables) of data. The derive
, select
, and join
transforms change the number of columns in a table. The first two never affect the number of rows in a table. join
may change the number of rows, depending on the variation chosen.
This final example combines the above into a single query. It illustrates a pipeline - the fundamental basis of PRQL. We simply add new lines (transforms) at the end of the query. Each transform modifies the relation produced by the statement above to produce the desired result.
from inv=invoices
join cust=customers (==customer_id)
derive { VAT = inv.total * 0.19 }
select {
OrderID = inv.invoice_id,
CustomerName = cust.last_name,
Total = inv.total,
VAT,
}