Faceted search using PostgreSQL full text search
Faceted search or faceted navigation allows users to narrow down search results by applying multiple filters generated from some attributes or tags. In this article we will implement faceted search using PostgreSQL full text searchopen in new window and ts_stat
function.
GitHub search is a good example of faceted navigation (see the image on the right).
Creating a table
Let’s start by creating books
table with a name, tags (attributes), and a text search vector:
CREATE TABLE books (
id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
name varchar(1000),
tags jsonb,
tsv tsvector
);
tsvectoropen in new window is a sorted list of distinct normalized words (lexemes) that are used for searching. You can create a tsvector using to_tsvector
function:
SELECT to_tsvector('english', 'The Fat Rats');
to_tsvector
-----------------
'fat':2 'rat':3
You can use to_tsvector
when inserting rows to the table:
INSERT INTO books (name, tsv)
VALUES
('hello word', to_tsvector('english', 'hello world')),
('foo bar', to_tsvector('english', 'foo bar'))
RETURNING *;
id | name | attrs | tsv
----+------------+-------+---------------------
1 | hello word | | 'hello':1 'world':2
2 | foo bar | | 'bar':2 'foo':1
Once you have some data, you can search over books using a tsvector
and a tsquery
:
SELECT * FROM books
WHERE tsv @@ websearch_to_tsquery('english', 'hello');
id | name | tags | tsv
----+------------+------+---------------------
1 | hello word | | 'hello':1 'world':2
That query can be slow if your dataset is large, but you can make it faster by adding an inverted index on tsv
column:
CREATE INDEX books_tsv_idx ON books USING GIN (tsv);
And check that PostgreSQL uses the index:
EXPLAIN ANALYZE
SELECT * FROM books
WHERE tsv @@ websearch_to_tsquery('english', 'hello');
Creating facets from tags
We will be using the following dataset to test our queries:
- model: Book
rows:
- name: The Gods Themselves by Isaac Asimov
tags:
- moods:adventurous
- moods:challenging
- pace:medium
- name: Legend by David Gemmell
tags:
- moods:adventurous
- moods:emotional
- pace:fast
- name: Lord of Light by Roger Zelazny
tags:
- moods:adventurous
- moods:challenging
- pace:medium
- name: The Name of the Wind by Patrick Rothfuss
tags:
- moods:adventurous
- moods:mysterious
- pace:medium
- name: Hyperion by Dan Simmons
tags:
- moods:mysterious
- moods:adventurous
- pace:medium
You can insert those books using the following query:
INSERT INTO "books" ("name", "tags", "tsv")
VALUES
('The Gods Themselves by Isaac Asimov', '["moods:adventurous","moods:challenging","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:challenging","pace:medium"}')),
('Legend by David Gemmell', '["moods:adventurous","moods:emotional","pace:fast"]', array_to_tsvector('{"moods:adventurous","moods:emotional","pace:fast"}')),
('Lord of Light by Roger Zelazny', '["moods:adventurous","moods:challenging","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:challenging","pace:medium"}')),
('The Name of the Wind by Patrick Rothfuss', '["moods:adventurous","moods:mysterious","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:mysterious","pace:medium"}')),
('Hyperion by Dan Simmons', '["moods:mysterious","moods:adventurous","pace:medium"]', array_to_tsvector('{"moods:mysterious","moods:adventurous","pace:medium"}'));
And then filter books by tags:
SELECT name, tags FROM books WHERE tsv @@ 'moods\:mysterious'::tsquery;
name | tags
------------------------------------------+----------------------------------------------------------
The Name of the Wind by Patrick Rothfuss | ["moods:adventurous", "moods:mysterious", "pace:medium"]
Hyperion by Dan Simmons | ["moods:mysterious", "moods:adventurous", "pace:medium"]
Constructing a facet
Let’s start by defining a facet we are expecting to get in the end:
Attr | Value | Book count |
---|---|---|
moods | adventurous | 5 |
moods | challenging | 2 |
moods | emotional | 1 |
moods | mysterious | 2 |
pace | fast | 1 |
pace | medium | 4 |
We could easily achieve that result with the following query:
WITH tags AS (
SELECT jsonb_array_elements_text(tags) AS tag
FROM books
)
SELECT
split_part(tag, ':', 1) AS attr,
split_part(tag, ':', 2) AS value,
count(*) AS count
FROM tags
GROUP by attr, value
ORDER BY attr, value, count DESC;
attr | value | count
-------+-------------+-------
moods | adventurous | 5
moods | challenging | 2
moods | emotional | 1
moods | mysterious | 2
pace | fast | 1
pace | medium | 4
But it is rather slow and inefficient because we need to select all tags
to build the facet. Can we do better? Yes, using ts_stat
function to get the required data directly from the tsv
column.
Retrieving document stats
The function ts_stat
allows to retrieve document statistics that are maitained by PostgreSQL full text search engine in tsvector
columns.
SELECT word, ndoc FROM ts_stat($$ SELECT tsv FROM books $$) ORDER BY word;
word | ndoc
-------------------+------
moods:adventurous | 5
moods:challenging | 2
moods:emotional | 1
moods:mysterious | 2
pace:fast | 1
pace:medium | 4
As you can see, PostgreSQL already maintains the stats we need to build the facet only using the tsv
column:
SELECT
split_part(word, ':', 1) AS attr,
split_part(word, ':', 2) AS value,
ndoc AS count
FROM ts_stat($$ SELECT tsv FROM books $$)
ORDER BY word;
attr | value | count
-------+-------------+-------
moods | adventurous | 5
moods | challenging | 2
moods | emotional | 1
moods | mysterious | 2
pace | fast | 1
pace | medium | 4
To build a refined facet, you can use a fast filter over the same tsv
column that is covered by the index we created earlier:
SELECT
split_part(word, ':', 1) AS attr,
split_part(word, ':', 2) AS value,
ndoc AS count
FROM ts_stat($$
SELECT tsv FROM books
WHERE tsv @@ 'pace\:fast'::tsquery
$$)
ORDER BY word;
attr | value | count
-------+-------------+-------
moods | adventurous | 1
moods | emotional | 1
pace | fast | 1
Conclusion
PostgreSQL provides everything you need to build fast faceted search for datasets up to 1 million rows. With larger datasets the processing time becomes an issue and you may need to shard your database.
You can also check pg-faceted-searchopen in new window example that demonstrates how to implement faceted search using Go and Bun database client.