Dealing with accented characters

Is there any way to write a search function that can search for words
that contain accented characters when the user types in words without
accented characters?

My database has a lot of names in it that have characters with accents
and other non-keyboard characters. When users search the database, I
would like for them to be able to find records with accented characters
even if they don't type in the accent. For instance, a user might be
searching for a text by the author Chrétien de Troyes. Right now, they
have to type "Chrétien" into the search form to find him: I would like
for a search for "Chretien" to also find "Chrétien."

This strikes me as a rather common problem: is there a good solution for
it?

Is there any way to write a search function that can search for words
that contain accented characters when the user types in words without
accented characters?

Play with your database collation settings

Fred

Quoting Frederick Cheung <frederick.cheung@gmail.com>:

>
> Is there any way to write a search function that can search for words
> that contain accented characters when the user types in words without
> accented characters?
>
Play with your database collation settings

Fred

This works if you have only one language. With multiple languages, you need
to keep the locale, switching as needed. Generic Latin1 may do what you need.

Jeffrey

One approach is to transliterate your input, e.g.:

http://interglacial.com/~sburke/tpj/as_html/tpj22.html
-- Sean M. Burke, Unidecode!, 2001

That way, "Chrétien" becomes "chretien" or some such for the purpose of your search, but remains "Chrétien" in the text.

For example, both El-Aaiún and El-Aaiun could reference the same underlying text:

http://svr225.stepx.com:3388/El-Aaiún
http://svr225.stepx.com:3388/El-Aaiun

Cheers,

One approach is to transliterate your input, e.g.:

http://interglacial.com/~sburke/tpj/as_html/tpj22.html
-- Sean M. Burke, Unidecode!, 2001

That way, "Chrétien" becomes "chretien" or some such for the purpose
of your search, but remains "Chrétien" in the text.

For example, both El-Aaiún and El-Aaiun could reference the same
underlying text:

http://svr225.stepx.com:3388/El-Aaiún
http://svr225.stepx.com:3388/El-Aaiun

This looks really promising, but after reading up on this for a while, I
don't see how to get it to work with Rails... could you give me a few
pointers or direct me to some documentation?

Thank you!!

One approach is to transliterate your input, e.g.:

http://interglacial.com/~sburke/tpj/as_html/tpj22.html
-- Sean M. Burke, Unidecode!, 2001

That way, "Chrétien" becomes "chretien" or some such for the purpose
of your search, but remains "Chrétien" in the text.

For example, both El-Aaiún and El-Aaiun could reference the same
underlying text:

http://svr225.stepx.com:3388/El-Aaiún
http://svr225.stepx.com:3388/El-Aaiun

This looks really promising, but after reading up on this for a while, I
don't see how to get it to work with Rails... could you give me a few
pointers or direct me to some documentation?

At its core, Unidecode is simply a lookup table. Should be rather straightforward to port to Ruby if it hasn't been done already.

Here is the original Perl implementation:

http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

And bellow is a Lua port of it:

http://dev.alt.textdrive.com/browser/HTTP/Unidecode.lua

As well as the lookup table themselves:

http://dev.alt.textdrive.com/browser/HTTP/Unidecode

Usage example:

local Unidecode = require( 'Unidecode' )

print( 1, 'Москва́', Unidecode( 'Москва́' ) )
print( 2, '北京', Unidecode( '北京' ) )
print( 3, 'Ἀθηνᾶ', Unidecode( 'Ἀθηνᾶ' ) )
print( 4, '서울', Unidecode( '서울' ) )
print( 5, '東京', Unidecode( '東京' ) )
print( 6, '京都市', Unidecode( '京都市' ) )
print( 7, 'नेपाल', Unidecode( 'नेपाल' ) )

> 1 Москва́ Moskva
> 2 北京 beijing
> 3 Ἀθηνᾶ Athena
> 4 서울 seoul
> 5 東京 dongjing
> 6 京都市 jingdushi
> 7 नेपाल nepaal

If Unidecode is too much of a good thing, one could use iconv translit or such, e.g. iconv( 'utf-8', 'us-ascii//TRANSLIT' )...

One way or another, the crux of it is to transliterate your data as well as you query. And then use the later to search the former.

Cheers,

Petite Abeille [2008-10-02 19:56]:

At its core, Unidecode is simply a lookup table. Should be rather
straightforward to port to Ruby if it hasn't been done already.

i wanted to do it, but it's been there for over a year now:

<http://rubyforge.org/projects/unidecode>

cheers
jens