International character search.

soulhunter · October 12, 2006, 2:19am

Hi!

I'm working on a program, and I need to do case-insensitive search with international characters on it, like:

ñáéíóúàèìòùäëïöü and so on.

Anyway, I found a way of implementing it, but I don't quite like it because it would implies create the autocomplete function for *each* autocomplete I have in my project.

The way of doing so I found is to change the condition from:

LOWER(column) like '%thing_downcased%'

to

column ~* 'thing_downcased'

and replacing the international characters for the [ñÑ] kind of expression, like this:

name ~* 'la [ñÑ]apa'

and it actually works (at least with postgresql), but then, I would need to do the substitution everytime I do a search, and I would need to reimplement the autocomplete function for each autocompletion with the new schema.

Any better idea?,

Sincerely,

Ildefonso Camargo

fxn · October 12, 2006, 9:08pm

Just to share a different approach, since you can't expect users to type accented words correctly, I usually store a normalized extra column (say name_normalized) for searches maintained in some Rails-way like filters, or store just the normalization of them in ferret. Then any query has to be normalized.

-- fxn

# Utility method that retursn an ASCIIfied, downcased, and sanitized string. # It relies on the Unicode Hacks plugin by means of String#chars. We assume # $KCODE is 'u' in environment.rb. By now we support a wide range of latin # accented letters, based on the Unicode Character Palette bundled in Macs. def self.normalize(str) n = str.chars.downcase.strip.to_s n.gsub!(/[àáâãäåāąă]/u, 'a') n.gsub!(/\s+/, ' ') n.gsub!(/æ/u, 'ae') n.gsub!(/[ďđ]/u, 'd') n.gsub!(/[çćčĉċ]/u, 'c') n.gsub!(/[èéêëēęěĕė]/u, 'e') n.gsub!(/ƒ/u, 'f') n.gsub!(/[ĝğġģ]/u, 'g') n.gsub!(/[ĥħ]/, 'h') n.gsub!(/[ììíîïīĩĭ]/u, 'i') n.gsub!(/[įıĳĵ]/u, 'j') n.gsub!(/[ķĸ]/u, 'k') n.gsub!(/[łľĺļŀ]/u, 'l') n.gsub!(/[ñńňņŉŋ]/u, 'n') n.gsub!(/[òóôõöøōőŏŏ]/u, 'o') n.gsub!(/œ/u, 'oe') n.gsub!(/[ŕřŗ]/u, 'r') n.gsub!(/[śšşŝș]/u, 's') n.gsub!(/[ťţŧț]/u, 't') n.gsub!(/[ùúûüūůűŭũų]/u, 'u') n.gsub!(/ŵ/u, 'w') n.gsub!(/[ýÿŷ]/u, 'y') n.gsub!(/[žżź]/u, 'z') n.gsub!(/[^\sa-z0-9_-]/, '') n end

David_Balmain · October 13, 2006, 12:47am

Sweet! I've just been looking for a character conversion chart like this to add a filter to Ferret. In a future version of Ferret (coming very soon) this will be a lot easier and faster. I'll probably put an option on the StandardAnalyzer called :normalize_unicode or something.

Thanks Xavier, Dave

fxn · October 13, 2006, 1:57am

Excelent!

I noticed in the mail that a q-like character was among the a-like character class, I moved that out and send the normalizer again for the archives:

# Utility method that retursn an ASCIIfied, downcased, and sanitized string. # It relies on the Unicode Hacks plugin by means of String#chars. We assume # $KCODE is 'u' in environment.rb. By now we support a wide range of latin # accented letters, based on the Unicode Character Palette bundled in Macs. def self.normalize(str) n = str.chars.downcase.strip.to_s n.gsub!(/[àáâãäåāă]/u, 'a') n.gsub!(/æ/u, 'ae') n.gsub!(/[ďđ]/u, 'd') n.gsub!(/[çćčĉċ]/u, 'c') n.gsub!(/[èéêëēęěĕė]/u, 'e') n.gsub!(/ƒ/u, 'f') n.gsub!(/[ĝğġģ]/u, 'g') n.gsub!(/[ĥħ]/, 'h') n.gsub!(/[ììíîïīĩĭ]/u, 'i') n.gsub!(/[įıĳĵ]/u, 'j') n.gsub!(/[ķĸ]/u, 'k') n.gsub!(/[łľĺļŀ]/u, 'l') n.gsub!(/[ñńňņŉŋ]/u, 'n') n.gsub!(/[òóôõöøōőŏŏ]/u, 'o') n.gsub!(/œ/u, 'oe') n.gsub!(/ą/u, 'q') n.gsub!(/[ŕřŗ]/u, 'r') n.gsub!(/[śšşŝș]/u, 's') n.gsub!(/[ťţŧț]/u, 't') n.gsub!(/[ùúûüūůűŭũų]/u, 'u') n.gsub!(/ŵ/u, 'w') n.gsub!(/[ýÿŷ]/u, 'y') n.gsub!(/[žżź]/u, 'z') n.gsub!(/\s+/, ' ') n.gsub!(/[^\sa-z0-9_-]/, '') n end

-- fxn

Topic		Replies	Views
Dealing with accented characters rubyonrails-talk	6	143	October 2, 2008
Accentuated characters with find function rubyonrails-talk	3	159	October 23, 2006
which gem or lib ? name forgotten rubyonrails-talk	1	176	February 9, 2012
Translating international characters rubyonrails-talk	4	147	April 22, 2008
AGAIN SEARCH ENGINE QUESTION rubyonrails-talk	1	98	September 1, 2007

International character search.

Related topics

More Resources