Text Search on PostgreSQL
PostgreSQL has out of box support for text search.
Assume we have a table of documents:
CREATE TABLE documents
(
id serial NOT NULL,
doc text
)
INSERT INTO documents(doc)
VALUES ("Lorem ipsum .....");
INSERT INTO documents(doc)
VALUES ("Quick brown fox .....");
------------------------------------
id | doc
------------------------------------
0 | "Lorem ipsum ....."
1 | "Quick brown fox ..."
A simple text search is a basic requirement in any system. This can be done using tsvector
and tsquery
types in PostgreSQL.
[tsvector](www.postgresql.org/docs/9.1/static/datatype-textsearch.html) gives us the list of lexemes for any given text.
[tsquery](www.postgresql.org/docs/9.1/static/datatype-textsearch.html) helps facilitate the search by creating lexemes for search terms, combine search terms / lexemes and compare with tsvector for result.
The to_tsvector
method processes text by removing stop words, stem and normalize words so that they can be used with different variants of the word.
For example, precision would become precis and running would become run
On every insert of a document, we need to get the normalized text of the document and add it to the normalized text column. For this we need to create a new column of type tsvector.
ALTER TABLE documents ADD COLUMN tsv TYPE tsvector;
Next, we need to create a trigger function that will update the tsv column on every insert
CREATE TRIGGER tsvupdate
BEFORE INSERT OR UPDATE
ON documents
FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger(tsv, 'pg.catalog.english', doc);
tsvector_update_trigger()
is built-in method which takes arguments -
- column to store the normalized text
- language of the text (because removing stop words and stemming is specific to a language)
- column_to_read_text_from
- column_to_read_text_from (takes multiple columns as input)
With data populated inside the documents table, we can perform a simple text search using the query:
WITH q AS (SELECT to_tsquery('brown:*') AS query)
SELECT id, doc, tsv from documents, q where q.query @@ documents.tsv;
The to_tsquery
function will convert the input text to tsquery type which can be used to do logical operations - & (AND), | (OR), ! (NOT)
, with lexemes and perform prefix matching using ":*"
The @@
operator checks if the tsvector matches tsquery
So the above query would return us documents which contain “brown”
Limitation: #
tsvector
and tsquery
will only help us find words from a given text but not substring matching.
For substring matching we will have to use the pg_trgm
extension (Trigram based text search). The pg_trgm extension can perform LIKE
operation on text fields.