I'm working on a program, and I need to do case-insensitive search with
international characters on it, like:
ñáéíóúàèìòùäëïöü and so on.
Anyway, I found a way of implementing it, but I don't quite like it
because it would implies create the autocomplete function for *each*
autocomplete I have in my project.
The way of doing so I found is to change the condition from:
LOWER(column) like '%thing_downcased%'
to
column ~* 'thing_downcased'
and replacing the international characters for the [ñÑ] kind of
expression, like this:
name ~* 'la [ñÑ]apa'
and it actually works (at least with postgresql), but then, I would
need to do the substitution everytime I do a search, and I would need
to reimplement the autocomplete function for each autocompletion with
the new schema.
Just to share a different approach, since you can't expect users to type accented words correctly, I usually store a normalized extra column (say name_normalized) for searches maintained in some Rails-way like filters, or store just the normalization of them in ferret. Then any query has to be normalized.
-- fxn
# Utility method that retursn an ASCIIfied, downcased, and sanitized string.
# It relies on the Unicode Hacks plugin by means of String#chars. We assume
# $KCODE is 'u' in environment.rb. By now we support a wide range of latin
# accented letters, based on the Unicode Character Palette bundled in Macs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[àáâãäåāąă]/u, 'a')
n.gsub!(/\s+/, ' ')
n.gsub!(/æ/u, 'ae')
n.gsub!(/[ďđ]/u, 'd')
n.gsub!(/[çćčĉċ]/u, 'c')
n.gsub!(/[èéêëēęěĕė]/u, 'e')
n.gsub!(/ƒ/u, 'f')
n.gsub!(/[ĝğġģ]/u, 'g')
n.gsub!(/[ĥħ]/, 'h')
n.gsub!(/[ììíîïīĩĭ]/u, 'i')
n.gsub!(/[įıijĵ]/u, 'j')
n.gsub!(/[ķĸ]/u, 'k')
n.gsub!(/[łľĺļŀ]/u, 'l')
n.gsub!(/[ñńňņʼnŋ]/u, 'n')
n.gsub!(/[òóôõöøōőŏŏ]/u, 'o')
n.gsub!(/œ/u, 'oe')
n.gsub!(/[ŕřŗ]/u, 'r')
n.gsub!(/[śšşŝș]/u, 's')
n.gsub!(/[ťţŧț]/u, 't')
n.gsub!(/[ùúûüūůűŭũų]/u, 'u')
n.gsub!(/ŵ/u, 'w')
n.gsub!(/[ýÿŷ]/u, 'y')
n.gsub!(/[žżź]/u, 'z')
n.gsub!(/[^\sa-z0-9_-]/, '')
n
end
Sweet! I've just been looking for a character conversion chart like
this to add a filter to Ferret. In a future version of Ferret (coming
very soon) this will be a lot easier and faster. I'll probably put an
option on the StandardAnalyzer called :normalize_unicode or something.
I noticed in the mail that a q-like character was among the a-like character class, I moved that out and send the normalizer again for the archives:
# Utility method that retursn an ASCIIfied, downcased, and sanitized string.
# It relies on the Unicode Hacks plugin by means of String#chars. We assume
# $KCODE is 'u' in environment.rb. By now we support a wide range of latin
# accented letters, based on the Unicode Character Palette bundled in Macs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[àáâãäåāă]/u, 'a')
n.gsub!(/æ/u, 'ae')
n.gsub!(/[ďđ]/u, 'd')
n.gsub!(/[çćčĉċ]/u, 'c')
n.gsub!(/[èéêëēęěĕė]/u, 'e')
n.gsub!(/ƒ/u, 'f')
n.gsub!(/[ĝğġģ]/u, 'g')
n.gsub!(/[ĥħ]/, 'h')
n.gsub!(/[ììíîïīĩĭ]/u, 'i')
n.gsub!(/[įıijĵ]/u, 'j')
n.gsub!(/[ķĸ]/u, 'k')
n.gsub!(/[łľĺļŀ]/u, 'l')
n.gsub!(/[ñńňņʼnŋ]/u, 'n')
n.gsub!(/[òóôõöøōőŏŏ]/u, 'o')
n.gsub!(/œ/u, 'oe')
n.gsub!(/ą/u, 'q')
n.gsub!(/[ŕřŗ]/u, 'r')
n.gsub!(/[śšşŝș]/u, 's')
n.gsub!(/[ťţŧț]/u, 't')
n.gsub!(/[ùúûüūůűŭũų]/u, 'u')
n.gsub!(/ŵ/u, 'w')
n.gsub!(/[ýÿŷ]/u, 'y')
n.gsub!(/[žżź]/u, 'z')
n.gsub!(/\s+/, ' ')
n.gsub!(/[^\sa-z0-9_-]/, '')
n
end