Transliterate vs Unidecode

Hi,

I'm not entirely satisfied with the way
ActiveSupport::Inflector.transliterate works:
- "œuf" (egg in French) is transliterated into "uf" instead of the
more logical "oeuf"
- "Straße" is transliterated into "Strae" instead of "Strasse"
- "€" is transliterated into nothing (blank string)

The result is that you end up with meaningless URLs when you generate
them with parametrize, which uses transliterate.

The "unidecode" gem (http://rubyforge.org/projects/unidecode/) has a
different approach:
- any ligature is expanded into separate characters, "œ" is
transliterated into "oe", "ß" into "ss", etc.
- more generally, unidecode always tries to find a replacement. For
example, "€" is transliterated into "EU".

What do you think: do you prefer the transliterate approach that
ignores any fancy character or the unidecode gem that always tries to
have a meaningfull replacement?
Would it make sense to propose a patch that includes and uses the
unidecode gem for the transliterate method?

Martin

I'm not entirely satisfied with the way
ActiveSupport::Inflector.transliterate works:
- "œuf" (egg in French) is transliterated into "uf" instead of the
more logical "oeuf"
- "Straße" is transliterated into "Strae" instead of "Strasse"
- "€" is transliterated into nothing (blank string)

Agreed, it has absolutely no value unless you have an English application and want to get rid of 'those pesky unreadable characters'. Because sensible transliteration is dependent on both the source and destination locale it's really hard to solve. I think it shouldn't be included in Rails at all and should be solved in a separate library or gem.

Manfred

I agree with you.

While I may not agree with € => "EU", I think it would be nice if
transliterate relied on Unidecode if the gem is present (very similar
to the approach with the textilize helper).

Basically it means "Rails is running under an environment that cares
about these characters – rely on the gem".

Agreed, it has absolutely no value unless you have an English
application and want to get rid of 'those pesky unreadable
characters'. Because sensible transliteration is dependent on both the
source and destination locale it's really hard to solve. I think it
shouldn't be included in Rails at all and should be solved in a
separate library or gem.

When you step outside the latin-derived languages the transliterate
code is even more problematic. 馬鹿 should return baka but returns '',
the same is true of any korean, thai or cyrillic text.

This kind of thing is completely outside the scope of parameterize and
I can't imagine we'll ever get a decent solution baked in to rails.
For applications which care about this kind of thing, parameterize
won't ever be a decent solution. They can and should just put the
values into the url like wikipedia does. If there are bugs with
routes matching non-ascii values, we'll fix them.