Dealing with accented characters

11175 · September 29, 2008, 8:26pm

Is there any way to write a search function that can search for words that contain accented characters when the user types in words without accented characters?

My database has a lot of names in it that have characters with accents and other non-keyboard characters. When users search the database, I would like for them to be able to find records with accented characters even if they don't type in the accent. For instance, a user might be searching for a text by the author Chrétien de Troyes. Right now, they have to type "Chrétien" into the search form to find him: I would like for a search for "Chretien" to also find "Chrétien."

This strikes me as a rather common problem: is there a good solution for it?

Frederick_Cheung · September 29, 2008, 8:32pm

Is there any way to write a search function that can search for words that contain accented characters when the user types in words without accented characters?

Play with your database collation settings

Fred

Jeffrey · October 1, 2008, 4:39pm

Quoting Frederick Cheung <frederick.cheung@gmail.com>:

> > Is there any way to write a search function that can search for words > that contain accented characters when the user types in words without > accented characters? > Play with your database collation settings

Fred

This works if you have only one language. With multiple languages, you need to keep the locale, switching as needed. Generic Latin1 may do what you need.

Jeffrey

Petite_Abeille · October 1, 2008, 7:01pm

One approach is to transliterate your input, e.g.:

http://interglacial.com/~sburke/tpj/as_html/tpj22.html -- Sean M. Burke, Unidecode!, 2001

That way, "Chrétien" becomes "chretien" or some such for the purpose of your search, but remains "Chrétien" in the text.

For example, both El-Aaiún and El-Aaiun could reference the same underlying text:

http://svr225.stepx.com:3388/El-Aaiún http://svr225.stepx.com:3388/El-Aaiun

Cheers,

11175 · October 2, 2008, 12:25am

One approach is to transliterate your input, e.g.:

Unidecode! -- Sean M. Burke, Unidecode!, 2001

That way, "Chrétien" becomes "chretien" or some such for the purpose of your search, but remains "Chrétien" in the text.

For example, both El-Aaiún and El-Aaiun could reference the same underlying text:

http://svr225.stepx.com:3388/El-Aaiún http://svr225.stepx.com:3388/El-Aaiun

This looks really promising, but after reading up on this for a while, I don't see how to get it to work with Rails... could you give me a few pointers or direct me to some documentation?

Thank you!!

Petite_Abeille · October 2, 2008, 5:56pm

One approach is to transliterate your input, e.g.:

Unidecode! -- Sean M. Burke, Unidecode!, 2001

That way, "Chrétien" becomes "chretien" or some such for the purpose of your search, but remains "Chrétien" in the text.

For example, both El-Aaiún and El-Aaiun could reference the same underlying text:

http://svr225.stepx.com:3388/El-Aaiún http://svr225.stepx.com:3388/El-Aaiun

This looks really promising, but after reading up on this for a while, I don't see how to get it to work with Rails... could you give me a few pointers or direct me to some documentation?

At its core, Unidecode is simply a lookup table. Should be rather straightforward to port to Ruby if it hasn't been done already.

Here is the original Perl implementation:

And bellow is a Lua port of it:

http://dev.alt.textdrive.com/browser/HTTP/Unidecode.lua

As well as the lookup table themselves:

http://dev.alt.textdrive.com/browser/HTTP/Unidecode

Usage example:

local Unidecode = require( 'Unidecode' )

print( 1, 'Москва́', Unidecode( 'Москва́' ) ) print( 2, '北京', Unidecode( '北京' ) ) print( 3, 'Ἀθηνᾶ', Unidecode( 'Ἀθηνᾶ' ) ) print( 4, '서울', Unidecode( '서울' ) ) print( 5, '東京', Unidecode( '東京' ) ) print( 6, '京都市', Unidecode( '京都市' ) ) print( 7, 'नेपाल', Unidecode( 'नेपाल' ) )

> 1 Москва́ Moskva > 2 北京 beijing > 3 Ἀθηνᾶ Athena > 4 서울 seoul > 5 東京 dongjing > 6 京都市 jingdushi > 7 नेपाल nepaal

If Unidecode is too much of a good thing, one could use iconv translit or such, e.g. iconv( 'utf-8', 'us-ascii//TRANSLIT' )...

One way or another, the crux of it is to transliterate your data as well as you query. And then use the later to search the former.

Cheers,

Jens_Wille · October 2, 2008, 5:59pm

Petite Abeille [2008-10-02 19:56]:

At its core, Unidecode is simply a lookup table. Should be rather straightforward to port to Ruby if it hasn't been done already.

i wanted to do it, but it's been there for over a year now:

<http://rubyforge.org/projects/unidecode>

cheers jens

Topic		Replies	Views
International character search. rubyonrails-talk	3	130	October 13, 2006
Accentuated characters with find function rubyonrails-talk	3	170	October 23, 2006
AGAIN SEARCH ENGINE QUESTION rubyonrails-talk	1	99	September 1, 2007
Unable to render lowercase french accent character in mssql rubyonrails-talk	0	140	January 16, 2009
Accents rubyonrails-talk	3	172	October 11, 2006

Dealing with accented characters

Related topics

More Resources