YAML, UTF-8, TextMate, Notepad

This is not a question but a report on the difficulties I had and the
solution I found with respect to UTF-8, YAML::load, and Ruby/Rails.

Comments are appreciated.

- - -

I had been struggling for two days to get UTF-8 working in my Rails app.

I had/have a localization file, lib\locale\de.yml, that had iso-8859-1
encoding. I could not get that to display properly.

Marnen, quite correctly, suggested that I transit to UTF-8. Of course,
I had tried to do that but I could not get the YAML localization file to
load.

What I had done was load the ANSI (i.e. iso-8859-1) localization file
into Notepad, convert to UTF-8, and saved that file.

Then all my German (de.yml) localizations failed.

It turns out that Notepad places "\xEF\xBB\xBF" at the beginning of the
file to indicate that this is a YAML file.

These three bytes appear to screw up YAML::load

Gimme a break!

Note only does Notepad put in these indicator bytes ... so does
TextMate.

In fact, TextMate will happily determine that your non-"\xEF\xBB\xBF"
file is a UTF-8 file and will automatically reinsert the indicator
bytes. I find this rather hysterical (not in a good way) since in
http://blog.macromates.com/2005/handling-encodings-utf-8/ one of the
authors of TextMate wrote "Property 3 turns out to be attractive because
it means we can heuristically recognize UTF-8 with a near 100% certainty
by checking if the file is valid. Some software think it’s a good idea
to embed a BOM (byte order mark) in the beginning of an UTF-8 file, but
it is not, because the file can already be recognized, and placing a BOM
in the beginning of a file means placing three bytes in the beginning of
the file which a program that use the file may not expect...".

How thoughtful that TextMate does what the article says it should not
do. If there is a way to turn off that behavior, I can't find it.
Maybe there's a TextMate bundle ... who knows?

In order to get YAML::Load to load the localization, I have to remove
the three indicator bytes. Yuck!

Once I did that, YAML loads happily.

- - - - - - - - -

If you store your locales in lib/locale and you use the
AVAILABLE_LOCALES idiom as suggested in
http://rails-i18n.org/wiki/pages/i18n-available_locales then you can use
this in config\initializers\available_locales.rb

- - -

#See http://guides.rubyonrails.org/i18n.html

# # Get loaded locales conveniently
# See http://rails-i18n.org/wiki/pages/i18n-available_locales
module I18n
  class << self
    def available_locales; backend.available_locales; end
  end

  module Backend
    class Simple
      def available_locales; translations.keys.collect { |l| l.to_s
}.sort; end end
    end
  end

  # You need to "force-initialize" loaded locales
  I18n.backend.send(:init_translations)

  AVAILABLE_LOCALES = I18n.backend.available_locales
  RAILS_DEFAULT_LOGGER.debug "* Loaded locales:
#{AVAILABLE_LOCALES.inspect}"

  #Shnelvar: Remove UTF-8 indicator bytes so that YAML::load works
  AVAILABLE_LOCALES.each do |localization_name|
    # localization_name is, e.g. "de"
    localization_name_dot_yml = localization_name + '.yml'
    localization_file_name =
File.join('lib/locale',localization_name_dot_yml)
    yaml_str = IO.read(localization_file_name)

    utf_8__3_byte_indicator = "\xEF\xBB\xBF"
    if yaml_str[0..2] == utf_8__3_byte_indicator
      yaml_str = yaml_str[3...yaml_str.size]
      File.open(localization_file_name,"w") { |f| f << yaml_str }
      puts localization_file_name + ' has had the UTF-8 indicator bytes
removed'
    end
  end

- - -

Suggestions and comments are welcome.

What I had done was load the ANSI (i.e. iso-8859-1) localization file
into Notepad, convert to UTF-8, and saved that file.

<…>

It turns out that Notepad places "\xEF\xBB\xBF" at the beginning of the
file to indicate that this is a YAML file.

This is not to indicate a YAML file (I doubt Notepad knows that YAML is at all).
This is Byte-Order-Mark http://en.wikipedia.org/wiki/Byte-order_mark

Gimme a break!

Note only does Notepad put in these indicator bytes ... so does
TextMate.

<…>

How thoughtful that TextMate does what the article says it should not
do. If there is a way to turn off that behavior, I can't find it.
Maybe there's a TextMate bundle ... who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Regards,
Rimantas

How thoughtful that TextMate does what the article says it should not
do. �If there is a way to turn off that behavior, I can't find it.
Maybe there's a TextMate bundle ... who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Yes ... absolutely certain.

I use a hex editor to remove the BOM ... resave.

I examine the file with another hex editor ... the BOM is not there.

I go into TextMate ... load the file ... resave ... and the BOM
reappears.

This only happens if TextMate detects UTF-8 characters in the file.

Ralph Shnelvar wrote:

How thoughtful that TextMate does what the article says it should not
do. �If there is a way to turn off that behavior, I can't find it.
Maybe there's a TextMate bundle ... who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Yes ... absolutely certain.

I use a hex editor to remove the BOM ... resave.

I examine the file with another hex editor ... the BOM is not there.

I go into TextMate ... load the file ... resave ... and the BOM
reappears.

This only happens if TextMate detects UTF-8 characters in the file.

Is there a setting to save as "UTF-8 without BOM" or something?

Best,

Marnen Laibow-Koser wrote:

Is there a setting to save as "UTF-8 without BOM" or something?

As I said earlier, if there is a setting, I can't find it.

TextMate has things call "bundles. These are mini-applications tht can
be integrated into TextMate. Someone, somewhere may have figured out
how to do it.

What UTF-8-compliant editor do you use, Marnen?

Marnen Laibow-Koser wrote:

Is there a setting to save as "UTF-8 without BOM" or something?

As I said earlier, if there is a setting, I can't find it.

Textmate does not save BOM for UTF-8 files.

Just choose save as, utf-8 and that's it.

Regards,
Rimantas

Rimantas Liubertas wrote:

Textmate does not save BOM for UTF-8 files.

Just choose save as, utf-8 and that's it.

Oh, Geez, I feel like a complete idiot ...

I am using "e" as the text editor ... which the advertising says is
"textmate for windows."

Sorry!

It is "e" that is saving BOM.

Ralph Shnelvar wrote:

Marnen Laibow-Koser wrote:

Is there a setting to save as "UTF-8 without BOM" or something?

As I said earlier, if there is a setting, I can't find it.

TextMate has things call "bundles. These are mini-applications tht can
be integrated into TextMate. Someone, somewhere may have figured out
how to do it.

What UTF-8-compliant editor do you use, Marnen?

I mostly use KomodoEdit, for whatever it's worth; also sometimes jEdit,
NetBeans, TextWrangler, Eclipse/Aptana...

Best,