Faceted search using PostgreSQL full text search

Faceted navigation and search - 图1

Faceted search or faceted navigation allows users to narrow down search results by applying multiple filters generated from some attributes or tags. In this article we will implement faceted search using PostgreSQL full text searchFaceted navigation and search - 图2open in new window and ts_stat function.

GitHub search is a good example of faceted navigation (see the image on the right).


Creating a table

Let’s start by creating books table with a name, tags (attributes), and a text search vector:

  1. CREATE TABLE books (
  2. id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  3. name varchar(1000),
  4. tags jsonb,
  5. tsv tsvector
  6. );

tsvectorFaceted navigation and search - 图3open in new window is a sorted list of distinct normalized words (lexemes) that are used for searching. You can create a tsvector using to_tsvector function:

  1. SELECT to_tsvector('english', 'The Fat Rats');
  2. to_tsvector
  3. -----------------
  4. 'fat':2 'rat':3

You can use to_tsvector when inserting rows to the table:

  1. INSERT INTO books (name, tsv)
  2. VALUES
  3. ('hello word', to_tsvector('english', 'hello world')),
  4. ('foo bar', to_tsvector('english', 'foo bar'))
  5. RETURNING *;
  6. id | name | attrs | tsv
  7. ----+------------+-------+---------------------
  8. 1 | hello word | | 'hello':1 'world':2
  9. 2 | foo bar | | 'bar':2 'foo':1

Once you have some data, you can search over books using a tsvector and a tsquery:

  1. SELECT * FROM books
  2. WHERE tsv @@ websearch_to_tsquery('english', 'hello');
  3. id | name | tags | tsv
  4. ----+------------+------+---------------------
  5. 1 | hello word | | 'hello':1 'world':2

That query can be slow if your dataset is large, but you can make it faster by adding an inverted index on tsv column:

  1. CREATE INDEX books_tsv_idx ON books USING GIN (tsv);

And check that PostgreSQL uses the index:

  1. EXPLAIN ANALYZE
  2. SELECT * FROM books
  3. WHERE tsv @@ websearch_to_tsquery('english', 'hello');

Creating facets from tags

We will be using the following dataset to test our queries:

  1. - model: Book
  2. rows:
  3. - name: The Gods Themselves by Isaac Asimov
  4. tags:
  5. - moods:adventurous
  6. - moods:challenging
  7. - pace:medium
  8. - name: Legend by David Gemmell
  9. tags:
  10. - moods:adventurous
  11. - moods:emotional
  12. - pace:fast
  13. - name: Lord of Light by Roger Zelazny
  14. tags:
  15. - moods:adventurous
  16. - moods:challenging
  17. - pace:medium
  18. - name: The Name of the Wind by Patrick Rothfuss
  19. tags:
  20. - moods:adventurous
  21. - moods:mysterious
  22. - pace:medium
  23. - name: Hyperion by Dan Simmons
  24. tags:
  25. - moods:mysterious
  26. - moods:adventurous
  27. - pace:medium

You can insert those books using the following query:

  1. INSERT INTO "books" ("name", "tags", "tsv")
  2. VALUES
  3. ('The Gods Themselves by Isaac Asimov', '["moods:adventurous","moods:challenging","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:challenging","pace:medium"}')),
  4. ('Legend by David Gemmell', '["moods:adventurous","moods:emotional","pace:fast"]', array_to_tsvector('{"moods:adventurous","moods:emotional","pace:fast"}')),
  5. ('Lord of Light by Roger Zelazny', '["moods:adventurous","moods:challenging","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:challenging","pace:medium"}')),
  6. ('The Name of the Wind by Patrick Rothfuss', '["moods:adventurous","moods:mysterious","pace:medium"]', array_to_tsvector('{"moods:adventurous","moods:mysterious","pace:medium"}')),
  7. ('Hyperion by Dan Simmons', '["moods:mysterious","moods:adventurous","pace:medium"]', array_to_tsvector('{"moods:mysterious","moods:adventurous","pace:medium"}'));

And then filter books by tags:

  1. SELECT name, tags FROM books WHERE tsv @@ 'moods\:mysterious'::tsquery;
  2. name | tags
  3. ------------------------------------------+----------------------------------------------------------
  4. The Name of the Wind by Patrick Rothfuss | ["moods:adventurous", "moods:mysterious", "pace:medium"]
  5. Hyperion by Dan Simmons | ["moods:mysterious", "moods:adventurous", "pace:medium"]

Constructing a facet

Let’s start by defining a facet we are expecting to get in the end:

AttrValueBook count
moodsadventurous5
moodschallenging2
moodsemotional1
moodsmysterious2
pacefast1
pacemedium4

We could easily achieve that result with the following query:

  1. WITH tags AS (
  2. SELECT jsonb_array_elements_text(tags) AS tag
  3. FROM books
  4. )
  5. SELECT
  6. split_part(tag, ':', 1) AS attr,
  7. split_part(tag, ':', 2) AS value,
  8. count(*) AS count
  9. FROM tags
  10. GROUP by attr, value
  11. ORDER BY attr, value, count DESC;
  12. attr | value | count
  13. -------+-------------+-------
  14. moods | adventurous | 5
  15. moods | challenging | 2
  16. moods | emotional | 1
  17. moods | mysterious | 2
  18. pace | fast | 1
  19. pace | medium | 4

But it is rather slow and inefficient because we need to select all tags to build the facet. Can we do better? Yes, using ts_stat function to get the required data directly from the tsv column.

Retrieving document stats

The function ts_stat allows to retrieve document statistics that are maitained by PostgreSQL full text search engine in tsvector columns.

  1. SELECT word, ndoc FROM ts_stat($$ SELECT tsv FROM books $$) ORDER BY word;
  2. word | ndoc
  3. -------------------+------
  4. moods:adventurous | 5
  5. moods:challenging | 2
  6. moods:emotional | 1
  7. moods:mysterious | 2
  8. pace:fast | 1
  9. pace:medium | 4

As you can see, PostgreSQL already maintains the stats we need to build the facet only using the tsv column:

  1. SELECT
  2. split_part(word, ':', 1) AS attr,
  3. split_part(word, ':', 2) AS value,
  4. ndoc AS count
  5. FROM ts_stat($$ SELECT tsv FROM books $$)
  6. ORDER BY word;
  7. attr | value | count
  8. -------+-------------+-------
  9. moods | adventurous | 5
  10. moods | challenging | 2
  11. moods | emotional | 1
  12. moods | mysterious | 2
  13. pace | fast | 1
  14. pace | medium | 4

To build a refined facet, you can use a fast filter over the same tsv column that is covered by the index we created earlier:

  1. SELECT
  2. split_part(word, ':', 1) AS attr,
  3. split_part(word, ':', 2) AS value,
  4. ndoc AS count
  5. FROM ts_stat($$
  6. SELECT tsv FROM books
  7. WHERE tsv @@ 'pace\:fast'::tsquery
  8. $$)
  9. ORDER BY word;
  10. attr | value | count
  11. -------+-------------+-------
  12. moods | adventurous | 1
  13. moods | emotional | 1
  14. pace | fast | 1

Conclusion

PostgreSQL provides everything you need to build fast faceted search for datasets up to 1 million rows. With larger datasets the processing time becomes an issue and you may need to shard your database.

You can also check pg-faceted-searchFaceted navigation and search - 图4open in new window example that demonstrates how to implement faceted search using Go and Bun database client.