Custom Ordering of Tag Search Results using Trigrams, ActiveRecord and Postgres
PostedI am getting the hang of creating these concise videos and felt like zipping through another topic.
This is a real-world example of how tag search is implemented in EveryPost, which I extrapolated to show how to order results when using fuzzy search combined with other ordering logic.
I’m a big believer in keeping things simple, which is why I restrict this example to using the ActsAsTaggableOn gem coupled with the built-in trigram extension in Postgres. You could solve this problem in more complex ways but both of these items already perform much of the heavy lifting needed to make a simple search like this very efficient.
Here I demonstrate how to implement ordering in straight SQL, and then implement the same solution as an ActiveRecord scope:
IMPORTANT!
I incorrectly stated that
Arel.sql()
prevents SQL injection. That is not true and is evident in the actual error message shown in the console. See the following instead:# Quoting the input.. quoted_name = connection.quote(name) Arel.sql("name ILIKE '#{quoted_name}%'") # ..or sanitizing the query sorted_name_sql = Arel.sql(sanitize_sql_like(["name ILIKE ?", "#{name}%"])) sorted_rank_sql = Arel.sql(sanitize_sql_array(["similarity(name, ?)", name]))
To sanitize user-provided values, you will want to use ActiveRecord::ConnectionAdapters::Quoting to process the input before it is used directly in SQL, or else ActiveRecord::Sanitization to sanitize the clause (preferable in this case).
There are many StackOverflow answers on using Trigrams to rank search results, so I don’t go into that above. Suffice it to say installing the pg_trgrm
extension is straightforward and the docs on query syntax are very clear (see above). Knowing what to do with the results, however, is a bit more complicated.
Here are a few other things I left out of the video in the interest of time.
Limits, Tenancy & De-duplication
I didn’t get into how the similar
scope is used. Here is the code showing how it is used in EveryPost:
def tag_list(search_term = "", limit = 8)
conditions = ActsAsTaggableOn::Tag.for_tenant(account)
conditions = search_term.present? ? conditions.similar(search_term) : conditions.order("name")
conditions.limit(limit).uniq.pluck("name")
end
Dissecting the above code bit:
- Tags are scoped to an ‘account’ (e.g. tenancy)
- The “similar” scope only makes sense in the presence of a search term.
- Limits results before returning an array of de-duped
name
fields.
Performance of Trigram Search
For a large enough set of tags, it makes sense to set up a GIN index for this similarity lookup as follows:
class AddTrigramIndexToTags < ActiveRecord::Migration
def change
add_index :tags, :name, using: :gin, opclass: {title: :gin_trgm_ops}, name: :index_on_tags_name_trigram
end
end
In practice (run locally), I needed 4-5k tag entries before the query optimizer substituted the index for a table scan. YMMV.
Similarity Threshold
Finally, one of the things I left out was the magic behind similarity matches in pg_trgrm
. Kidding! Not that much magic. It’s pretty straightforward.
The documentation states that the threshold for results to be considered ‘similar’ is “0.3”. Any search result with a similarity score < 0.3 will be omitted from all result sets.
Like other settings, this can be tweaked globally or per session. I opt not to change this globally and use this instead:
ActiveRecord::Base.connection.execute('SET pg_trgm.similarity_threshold = "0.23"')
I’m still fine-tuning this for tag search, but empirically, using a lower value provides a better match. Put another way, tag names tend to be short. When typing a short word, a single mistyped letter can mean the difference between receiving no results and correct results.