RE: [Rails] Removing Non Alpha & Numeric Characters From String

This is arguably a tad bit prettier:

  title.downcase.gsub(/[^a-z ]/, '').gsub(/ /, '-')

Not sure if it's that much better...

..but doesn't do the same thing as the OP's -- this one will strip out non-vanilla-ASCII accented characters, which could well be part of a title.

Not all books are written in America :slight_smile:

Roy Pardee wrote:

This is arguably a tad bit prettier:

  title.downcase.gsub(/[^a-z ]/, '').gsub(/ /, '-')

Not sure if it's that much better...

That works great (and looks prettier :-)! Thanks.

Well, Hassan makes a good point that this will eat any non-ascii characters. Consider whether you want to do that. If you don't, you'll likely have to url-encode the result (I don't think e.g., accented characters are usable in URLs, are they?

It depends on the Web server being able to handle it, but yes, you can have non-ISO-8859-1 characters in a URL.

Hmmm... are you sure? I thought one would need to encode anything but a small subset of US-ASCII:

"The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values."

"When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2". http://tools.ietf.org/html/rfc3986 In any case, one approach to URL normalization would be to transliterate the path to ASCII, then convert any non-alphanumeric characters into dashes or something, e.g.: €2 commemorative coins -> http://svr225.stepx.com:3388/eur2-commemorative-coins Hernán Cortés -> http://svr225.stepx.com:3388/hernan-cortes Scanian (linguistics) -> http://svr225.stepx.com:3388/scanian-linguistics Scheme (programming language) -> http://svr225.stepx.com:3388/scheme-programming-language Cheers,

As a simple test, I create a file called "Chrétien.txt" which I drop into a Tomcat web server to view as "http://localhost/sample/Chrétien\.txt"\.

Firefox 2 turns this into: http://localhost/sample/Chrétien.txt while Safari requests http://localhost/sample/Chrétien\.txt

But the main thing is that, regardless, the non-US-ASCII name is used to match the resource in the file system.

Firefox 2 turns this into: http://localhost/sample/Chrétien.txt while Safari requests http://localhost/sample/Chrétien\.txt

Even though Safari does indeed display the accentuated characters in its UI, it does encode the URL properly when sending the HTTP request to the server... take a look at your log...

But the main thing is that, regardless, the non-US-ASCII name is used to match the resource in the file system.

Well, yes... once it has been decoded from the HTTP request back to its original form...

Cheers,

What about multiple dashes in the middle of the title?

For example, given:

Primetime Emmy Award for Outstanding Lead Actress - Miniseries or a Movie

One would expect:

primetime-emmy-award-for-outstanding-lead-actress-miniseries-or-a-movie

Note the transition between 'Actress' and 'Miniseries'.

Cheers,