Text Search on PostgreSQL

PostgreSQL has out of box support for text search.

Assume we have a table of documents:

CREATE TABLE documents
  id serial NOT NULL,
  doc text

INSERT INTO documents(doc)
VALUES ("Lorem ipsum .....");

INSERT INTO documents(doc)
VALUES ("Quick brown fox .....");

id       | doc
0        | "Lorem ipsum ....."
1        | "Quick brown fox ..."

A simple text search is a basic requirement in any system. This can be done using tsvector and tsquery types in PostgreSQL.

[tsvector](www.postgresql.org/docs/9.1/static/datatype-textsearch.html) gives us the list of lexemes for any given text.

[tsquery](www.postgresql.org/docs/9.1/static/datatype-textsearch.html) helps facilitate the search by creating lexemes for search terms, combine search terms / lexemes and compare with tsvector for result.

The to_tsvector method processes text by removing stop words, stem and normalize words so that they can be used with different variants of the word.

For example, precision would become precis and running would become run

On every insert of a document, we need to get the normalized text of the document and add it to the normalized text column. For this we need to create a new column of type tsvector.

ALTER TABLE documents ADD COLUMN tsv TYPE tsvector;

Next, we need to create a trigger function that will update the tsv column on every insert

ON documents
EXECUTE PROCEDURE tsvector_update_trigger(tsv, 'pg.catalog.english', doc);

tsvector_update_trigger() is built-in method which takes arguments -

  1. column to store the normalized text
  2. language of the text (because removing stop words and stemming is specific to a language)
  3. column_to_read_text_from
  4. column_to_read_text_from (takes multiple columns as input)

With data populated inside the documents table, we can perform a simple text search using the query:

WITH q AS (SELECT to_tsquery('brown:*') AS query)
SELECT id, doc, tsv from documents, q where q.query @@ documents.tsv;

The to_tsquery function will convert the input text to tsquery type which can be used to do logical operations - & (AND), | (OR), ! (NOT), with lexemes and perform prefix matching using ":*"

The @@ operator checks if the tsvector matches tsquery

So the above query would return us documents which contain “brown”


Limitation: #

tsvector and tsquery will only help us find words from a given text but not substring matching.
For substring matching we will have to use the pg_trgm extension (Trigram based text search). The pg_trgm extension can perform LIKE operation on text fields.


Now read this

Parsing JSON in Scala

Introduction # I started a side project on Scala with a group of friends (noobs in scala). We chose Scala because it is well known for type safety and functional programming with support for OOP. One of the important parts of the project... Continue →