Faceted search using PostgreSQL full text search

Faceted search using PostgreSQL full text search

Faceted search or faceted navigation allows users to narrow down search results by applying multiple filters generated from some attributes or tags. In this article we will implement faceted search using PostgreSQL full text searchopen in new window and ts_stat function.

GitHub search is a good example of faceted navigation (see the image on the right).

Creating a table

Let’s start by creating books table with a name, tags (attributes), and a text search vector:

CREATE TABLE books (
  id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  name varchar(1000),
  tags jsonb,
  tsv tsvector
);

tsvectoropen in new window is a sorted list of distinct normalized words (lexemes) that are used for searching. You can create a tsvector using to_tsvector function:

SELECT to_tsvector('english', 'The Fat Rats');
   to_tsvector
-----------------
 'fat':2 'rat':3

You can use to_tsvector when inserting rows to the table:

INSERT INTO books (name, tsv)
VALUES
  ('hello word', to_tsvector('english', 'hello world')),
  ('foo bar', to_tsvector('english', 'foo bar'))
RETURNING *;
 id |    name    | attrs |         tsv
----+------------+-------+---------------------
  1 | hello word |       | 'hello':1 'world':2
  2 | foo bar    |       | 'bar':2 'foo':1

Once you have some data, you can search over books using a tsvector and a tsquery:

SELECT * FROM books
WHERE tsv @@ websearch_to_tsquery('english', 'hello');
 id |    name    | tags |         tsv
----+------------+------+---------------------
  1 | hello word |      | 'hello':1 'world':2

That query can be slow if your dataset is large, but you can make it faster by adding an inverted index on tsv column:

CREATE INDEX books_tsv_idx ON books USING GIN (tsv);

And check that PostgreSQL uses the index:

EXPLAIN ANALYZE
SELECT * FROM books
WHERE tsv @@ websearch_to_tsquery('english', 'hello');

We will be using the following dataset to test our queries:

- model: Book
  rows:
    - name: The Gods Themselves by Isaac Asimov
      tags:
        - moods:adventurous
        - moods:challenging
        - pace:medium
    - name: Legend by David Gemmell
      tags:
        - moods:adventurous
        - moods:emotional
        - pace:fast
    - name: Lord of Light by Roger Zelazny
      tags:
        - moods:adventurous
        - moods:challenging
        - pace:medium
    - name: The Name of the Wind by Patrick Rothfuss
      tags:
        - moods:adventurous
        - moods:mysterious
        - pace:medium
    - name: Hyperion by Dan Simmons
      tags:
        - moods:mysterious
        - moods:adventurous
        - pace:medium

You can insert those books using the following query:

INSERT INTO "books" ("name", "tags", "tsv")
VALUES
  ('The Gods Themselves by Isaac Asimov', '["moods:adventurous","moods:challenging","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:challenging","pace:medium"}')),
  ('Legend by David Gemmell', '["moods:adventurous","moods:emotional","pace:fast"]', array_to_tsvector('{"moods:adventurous","moods:emotional","pace:fast"}')),
  ('Lord of Light by Roger Zelazny', '["moods:adventurous","moods:challenging","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:challenging","pace:medium"}')),
  ('The Name of the Wind by Patrick Rothfuss', '["moods:adventurous","moods:mysterious","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:mysterious","pace:medium"}')),
  ('Hyperion by Dan Simmons', '["moods:mysterious","moods:adventurous","pace:medium"]', array_to_tsvector('{"moods:mysterious","moods:adventurous","pace:medium"}'));

And then filter books by tags:

SELECT name, tags FROM books WHERE tsv @@ 'moods\:mysterious'::tsquery;
                   name                   |                           tags
------------------------------------------+----------------------------------------------------------
 The Name of the Wind by Patrick Rothfuss | ["moods:adventurous", "moods:mysterious", "pace:medium"]
 Hyperion by Dan Simmons                  | ["moods:mysterious", "moods:adventurous", "pace:medium"]

Let’s start by defining a facet we are expecting to get in the end:

Attr	Value	Book count
moods	adventurous	5
moods	challenging	2
moods	emotional	1
moods	mysterious	2
pace	fast	1
pace	medium	4

We could easily achieve that result with the following query:

WITH tags AS (
  SELECT jsonb_array_elements_text(tags) AS tag
  FROM books
)
SELECT
  split_part(tag, ':', 1) AS attr,
  split_part(tag, ':', 2) AS value,
  count(*) AS count
FROM tags
GROUP by attr, value
ORDER BY attr, value, count DESC;
 attr  |    value    | count
-------+-------------+-------
 moods | adventurous |     5
 moods | challenging |     2
 moods | emotional   |     1
 moods | mysterious  |     2
 pace  | fast        |     1
 pace  | medium      |     4

But it is rather slow and inefficient because we need to select all tags to build the facet. Can we do better? Yes, using ts_stat function to get the required data directly from the tsv column.

Retrieving document stats

The function ts_stat allows to retrieve document statistics that are maitained by PostgreSQL full text search engine in tsvector columns.

SELECT word, ndoc FROM ts_stat($$ SELECT tsv FROM books $$) ORDER BY word;
       word        | ndoc
-------------------+------
 moods:adventurous |    5
 moods:challenging |    2
 moods:emotional   |    1
 moods:mysterious  |    2
 pace:fast         |    1
 pace:medium       |    4

As you can see, PostgreSQL already maintains the stats we need to build the facet only using the tsv column:

SELECT
  split_part(word, ':', 1) AS attr,
  split_part(word, ':', 2) AS value,
  ndoc AS count
FROM ts_stat($$ SELECT tsv FROM books $$)
ORDER BY word;
 attr  |    value    | count
-------+-------------+-------
 moods | adventurous |     5
 moods | challenging |     2
 moods | emotional   |     1
 moods | mysterious  |     2
 pace  | fast        |     1
 pace  | medium      |     4

To build a refined facet, you can use a fast filter over the same tsv column that is covered by the index we created earlier:

SELECT
  split_part(word, ':', 1) AS attr,
  split_part(word, ':', 2) AS value,
  ndoc AS count
FROM ts_stat($$
  SELECT tsv FROM books
  WHERE tsv @@ 'pace\:fast'::tsquery
$$)
ORDER BY word;
 attr  |    value    | count
-------+-------------+-------
 moods | adventurous |     1
 moods | emotional   |     1
 pace  | fast        |     1

Conclusion

PostgreSQL provides everything you need to build fast faceted search for datasets up to 1 million rows. With larger datasets the processing time becomes an issue and you may need to shard your database.

You can also check pg-faceted-searchopen in new window example that demonstrates how to implement faceted search using Go and Bun database client.

Faceted navigation and search

Faceted search using PostgreSQL full text search

Creating a table

Creating facets from tags

Constructing a facet

Retrieving document stats

Conclusion