Custom Ordering of Tag Search Results using Trigrams, ActiveRecord and Postgres

Posted

I am getting the hang of creating these concise videos and felt like zipping through another topic.

This is a real-world example of how tag search is implemented in EveryPost, which I extrapolated to show how to order results when using fuzzy search combined with other ordering logic.

I’m a big believer in keeping things simple, which is why I restrict this example to using the ActsAsTaggableOn gem coupled with the built-in trigram extension in Postgres. You could solve this problem in more complex ways but both of these items already perform much of the heavy lifting needed to make a simple search like this very efficient.

Here I demonstrate how to implement ordering in straight SQL, and then implement the same solution as an ActiveRecord scope:

IMPORTANT!

I incorrectly stated that Arel.sql() prevents SQL injection. That is not true and is evident in the actual error message shown in the console. See the following instead:

# Quoting the input..
quoted_name = connection.quote(name)
Arel.sql("name ILIKE '#{quoted_name}%'")

# ..or sanitizing the query
sorted_name_sql = Arel.sql(sanitize_sql_like(["name ILIKE ?", "#{name}%"]))
sorted_rank_sql = Arel.sql(sanitize_sql_array(["similarity(name, ?)", name]))

To sanitize user-provided values, you will want to use ActiveRecord::ConnectionAdapters::Quoting to process the input before it is used directly in SQL, or else ActiveRecord::Sanitization to sanitize the clause (preferable in this case).

(Link to video)

There are many StackOverflow answers on using Trigrams to rank search results, so I don’t go into that above. Suffice it to say installing the pg_trgrm extension is straightforward and the docs on query syntax are very clear (see above). Knowing what to do with the results, however, is a bit more complicated.

Here are a few other things I left out of the video in the interest of time.

Limits, Tenancy & De-duplication

I didn’t get into how the similar scope is used. Here is the code showing how it is used in EveryPost:

def tag_list(search_term = "", limit = 8)
  conditions = ActsAsTaggableOn::Tag.for_tenant(account)
  conditions = search_term.present? ? conditions.similar(search_term) : conditions.order("name")
  conditions.limit(limit).uniq.pluck("name")
end

Dissecting the above code bit:

  1. Tags are scoped to an ‘account’ (e.g. tenancy)
  2. The “similar” scope only makes sense in the presence of a search term.
  3. Limits results before returning an array of de-duped name fields.

For a large enough set of tags, it makes sense to set up a GIN index for this similarity lookup as follows:

class AddTrigramIndexToTags < ActiveRecord::Migration
  def change
    add_index :tags, :name, using: :gin, opclass: {title: :gin_trgm_ops}, name: :index_on_tags_name_trigram
  end
end

In practice (run locally), I needed 4-5k tag entries before the query optimizer substituted the index for a table scan. YMMV.

Similarity Threshold

Finally, one of the things I left out was the magic behind similarity matches in pg_trgrm. Kidding! Not that much magic. It’s pretty straightforward.

The documentation states that the threshold for results to be considered ‘similar’ is “0.3”. Any search result with a similarity score < 0.3 will be omitted from all result sets.

Like other settings, this can be tweaked globally or per session. I opt not to change this globally and use this instead:

ActiveRecord::Base.connection.execute('SET pg_trgm.similarity_threshold = "0.23"')

I’m still fine-tuning this for tag search, but empirically, using a lower value provides a better match. Put another way, tag names tend to be short. When typing a short word, a single mistyped letter can mean the difference between receiving no results and correct results.