May 31, 2014

Text Search on PostgreSQL

PostgreSQL has out of box support for text search.

Assume we have a table of documents:

CREATE TABLE documents
(
  id serial NOT NULL,
  doc text
)

INSERT INTO documents(doc)
VALUES ("Lorem ipsum .....");

INSERT INTO documents(doc)
VALUES ("Quick brown fox .....");

------------------------------------
id       | doc
------------------------------------
0        | "Lorem ipsum ....."
1        | "Quick brown fox ..."

A simple text search is a basic requirement in any system. This can be done using tsvector and tsquery types in PostgreSQL.

[tsvector](www.postgresql.org/docs/9.1/static/datatype-textsearch.html) gives us the list of lexemes for any given text.

[tsquery](www.postgresql.org/docs/9.1/static/datatype-textsearch.html) helps facilitate the search by creating lexemes for search terms, combine search terms / lexemes and compare with tsvector for result.

The to_tsvector method processes text by removing stop words, stem and normalize words so that they can be used with different variants of the word.

For example, precision would become precis and running would become run

On every insert of a document, we need to get the normalized text of the document and add it to the normalized text column. For this we need to create a new column of type tsvector.

ALTER TABLE documents ADD COLUMN tsv TYPE tsvector;

Next, we need to create a trigger function that will update the tsv column on every insert

CREATE TRIGGER tsvupdate
BEFORE INSERT OR UPDATE
ON documents
FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger(tsv, 'pg.catalog.english', doc);

tsvector_update_trigger() is built-in method which takes arguments -

column to store the normalized text
language of the text (because removing stop words and stemming is specific to a language)
column_to_read_text_from
column_to_read_text_from (takes multiple columns as input)

With data populated inside the documents table, we can perform a simple text search using the query:

WITH q AS (SELECT to_tsquery('brown:*') AS query)
SELECT id, doc, tsv from documents, q where q.query @@ documents.tsv;

The to_tsquery function will convert the input text to tsquery type which can be used to do logical operations - & (AND), | (OR), ! (NOT), with lexemes and perform prefix matching using ":*"

The @@ operator checks if the tsvector matches tsquery

So the above query would return us documents which contain “brown”

example-image

Limitation: #

tsvector and tsquery will only help us find words from a given text but not substring matching.
For substring matching we will have to use the pg_trgm extension (Trigram based text search). The pg_trgm extension can perform LIKE operation on text fields.

Kudos

Text Search on PostgreSQL

Limitation: #

Now read this

Machine Learning