martindonadieu@pg_bm25
Overview
pg_bm25
is a PostgreSQL extension that enables full text search over SQL tables using the BM25 algorithm, the state-of-the-art ranking function for full text search. It is built on top of Tantivy, the Rust-based alternative to Apache Lucene, using pgrx
.
pg_bm25
is supported on all versions supported by the PostgreSQL Global Development Group, which includes PostgreSQL 12+.
Check out the pg_bm25
benchmarks here.
Roadmap
- BM25 scoring
- Highlighting
- Boosted queries
- Filtering
- Bucket and metrics aggregations
- Autocomplete
- Fuzzy search
- Custom tokenizers
- JSON field search
- Datetime aggregations
- Facet fields
Installation
From ParadeDB
The easiest way to use the extension is to run the ParadeDB Dockerfile:
docker run \
-e POSTGRES_USER=<user> \
-e POSTGRES_PASSWORD=<password> \
-e POSTGRES_DB=<dbname> \
-p 5432:5432 \
-d \
paradedb/paradedb:latest
This will spin up a Postgres instance with pg_bm25
preinstalled.
From Self-Hosted PostgreSQL
If you are self-hosting Postgres and would like to use the extension within your existing Postgres, follow these steps:
Debian/Ubuntu
We provide pre-built binaries for Debian-based Linux for PostgreSQL 15 (more versions coming soon). You can download the latest version for your architecture from the releases page.
ParadeDB collects anonymous telemetry to help us understand how many people are using the project. You can opt-out of telemetry by setting export TELEMETRY=false
(or unsetting the variable) in your shell or in your ~/.bashrc
file before running the extension.
macOS and Windows
We don't suggest running production workloads on macOS or Windows. As a result, we don't provide prebuilt binaries for these platforms. If you are running Postgres on macOS or Windows and want to install pg_bm25
, please follow the development instructions, but do cargo pgrx install --release
instead of cargo pgrx run
. This will build the extension from source and install it in your Postgres instance.
You can then create the extension in your database by running:
CREATE EXTENSION pg_bm25;
Note: If you are using a managed Postgres service like Amazon RDS, you will not be able to install pg_bm25
until the Postgres service explicitly supports it.
Usage
Indexing
pg_bm25
comes with a helper function that creates a test table that you can use for quick experimentation.
CALL paradedb.create_bm25_test_table();
CREATE TABLE mock_items AS SELECT * FROM paradedb.bm25_test_table;
To index the table, use the following SQL command:
CREATE INDEX idx_mock_items
ON mock_items
USING bm25 ((mock_items.*))
WITH (key_field='id', text_fields='{"description": {}, "category": {}}');
Note the mandatory key_field
option in the WITH
code. Every bm25
index needs a key_field
, which should be the name of a column that will function as a row's unique identifier within the index. Usually, the key_field
can just be the name of your table's primary key column.
Once the indexing is complete, you can run various search functions on it.
Basic Search
Execute a search query on your indexed table:
SELECT description, rating, category
FROM mock_items
WHERE mock_items @@@ 'description:keyboard OR category:electronics'
LIMIT 5;
This will return:
description | rating | category
-----------------------------+--------+------------- Plastic Keyboard | 4 | Electronics Ergonomic metal keyboard | 4 | Electronics Innovative wireless earbuds | 5 | Electronics Fast charging power bank | 4 | Electronics Bluetooth-enabled speaker | 3 | Electronics (5 rows)
Scoring and highlighting are supported:
SELECT description, rating, category, paradedb.rank_bm25(id), paradedb.highlight_bm25(id, 'idx_mock_items', 'description')
FROM mock_items
WHERE mock_items @@@ 'description:keyboard OR category:electronics'
LIMIT 5;
This will return:
id | description | rating | category | rank_bm25 | highlight_bm25 ----+-----------------------------+--------+-------------+-----------+--------------------------------- 1 | Ergonomic metal keyboard | 4 | Electronics | 4.9403534 | Ergonomic metal <b>keyboard</b> 2 | Very plasticy keyboard | 4 | Electronics | 4.9403534 | Very plasticy <b>keyboard</b> 12 | Innovative wireless earbuds | 5 | Electronics | 2.1096356 | 22 | Fast charging power bank | 4 | Electronics | 2.1096356 | 32 | Bluetooth-enabled speaker | 3 | Electronics | 2.1096356 | (5 rows)
Scores can be tuned via boosted queries:
SELECT description, rating, category
FROM mock_items
WHERE mock_items @@@ 'description:keyboard^2 OR category:electronics';
New data that arrives or rows that are changed are automatically reindexed and searchable. For instance, let's create and search for a new row in our table:
INSERT INTO mock_items (description, rating, category) VALUES ('New keyboard', 5, 'Electronics');
SELECT description, rating, category
FROM mock_items
WHERE mock_items @@@ 'description:keyboard OR category:electronics'
LIMIT 5;
This will return:
description | rating | category
-----------------------------+--------+------------- New keyboard | 5 | Electronics Plastic Keyboard | 4 | Electronics Ergonomic metal keyboard | 4 | Electronics Innovative wireless earbuds | 5 | Electronics Fast charging power bank | 4 | Electronics (5 rows)
Please refer to the documentation for a more thorough overview of pg_bm25
's query support.
Development
Prerequisites
Before developing the extension, ensure that you have Rust installed
(version >1.70), ideally via rustup
(we've observed issues with installing Rust via Homebrew on macOS).
If you are on macOS and using Postgres.app, you'll first need to add the pg_config
binary to your path:
export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"
Then, install and initialize pgrx:
# Note: Replace --pg15 with your version of Postgres, if different (i.e. --pg16, --pg14, etc.)
cargo install --locked cargo-pgrx --version 0.11.1
cargo pgrx init --pg15=`which pg_config`
Running the Extension
First, start pgrx:
cargo pgrx run
This will launch an interactive connection to Postgres. Inside Postgres, create the extension by running:
CREATE EXTENSION pg_bm25;
Now, you have access to all the extension functions.
Modifying the Extension
If you make changes to the extension code, follow these steps to update it:
- Recompile the extension:
cargo pgrx run
- Recreate the extension to load the latest changes:
DROP EXTENSION pg_bm25;
CREATE EXTENSION pg_bm25;
Testing
To run the unit test suite, use the following command:
cargo pgrx test
This will run all unit tests defined in /src
. To add a new unit test, simply add tests inline in the relevant files, using the #[cfg(test)]
attribute.
To run the integration test suite, simply run:
./test/runtests.sh -p threaded
This will create a temporary database, initialize it with the SQL commands defined in fixtures.sql
, and run the tests in /test/sql
against it. To add a new test, simply add a new .sql
file to /test/sql
and a corresponding .out
file to /test/expected
for the expected output, and it will automatically get picked up by the test suite.
Note: the bash script takes arguments and allows you to run tests either sequentially or in parallel. For more info run ./test/runtests.sh -h
License
pg_bm25
is licensed under the GNU Affero General Public License v3.0.
Install
- Install the
dbdev
package manager - Install the package:
select dbdev.install('martindonadieu@pg_bm25');
create extension "martindonadieu@pg_bm25"
version '0.3.3';
Downloads
- 8 all time downloads
- 0 downloads in the last 30 days
- 0 downloads in the last 90 days
- 3 downloads in the last 180 days