YAML, UTF-8, TextMate, Notepad

This is not a question but a report on the difficulties I had and the solution I found with respect to UTF-8, YAML::load, and Ruby/Rails.

Comments are appreciated.

- - -

I had been struggling for two days to get UTF-8 working in my Rails app.

I had/have a localization file, lib\locale\de.yml, that had iso-8859-1 encoding. I could not get that to display properly.

Marnen, quite correctly, suggested that I transit to UTF-8. Of course, I had tried to do that but I could not get the YAML localization file to load.

What I had done was load the ANSI (i.e. iso-8859-1) localization file into Notepad, convert to UTF-8, and saved that file.

Then all my German (de.yml) localizations failed.

It turns out that Notepad places "\xEF\xBB\xBF" at the beginning of the file to indicate that this is a YAML file.

These three bytes appear to screw up YAML::load

Gimme a break!

Note only does Notepad put in these indicator bytes ... so does TextMate.

In fact, TextMate will happily determine that your non-"\xEF\xBB\xBF" file is a UTF-8 file and will automatically reinsert the indicator bytes. I find this rather hysterical (not in a good way) since in Handling encodings (UTF-8) one of the authors of TextMate wrote "Property 3 turns out to be attractive because it means we can heuristically recognize UTF-8 with a near 100% certainty by checking if the file is valid. Some software think it’s a good idea to embed a BOM (byte order mark) in the beginning of an UTF-8 file, but it is not, because the file can already be recognized, and placing a BOM in the beginning of a file means placing three bytes in the beginning of the file which a program that use the file may not expect...".

How thoughtful that TextMate does what the article says it should not do. If there is a way to turn off that behavior, I can't find it. Maybe there's a TextMate bundle ... who knows?

In order to get YAML::Load to load the localization, I have to remove the three indicator bytes. Yuck!

Once I did that, YAML loads happily.

- - - - - - - - -

If you store your locales in lib/locale and you use the AVAILABLE_LOCALES idiom as suggested in http://rails-i18n.org/wiki/pages/i18n-available_locales then you can use this in config\initializers\available_locales.rb

- - -

#See Rails Internationalization (I18n) API — Ruby on Rails Guides

# # Get loaded locales conveniently # See http://rails-i18n.org/wiki/pages/i18n-available_locales module I18n   class << self     def available_locales; backend.available_locales; end   end

  module Backend     class Simple       def available_locales; translations.keys.collect { |l| l.to_s }.sort; end end     end   end

  # You need to "force-initialize" loaded locales   I18n.backend.send(:init_translations)

  AVAILABLE_LOCALES = I18n.backend.available_locales   RAILS_DEFAULT_LOGGER.debug "* Loaded locales: #{AVAILABLE_LOCALES.inspect}"

  #Shnelvar: Remove UTF-8 indicator bytes so that YAML::load works   AVAILABLE_LOCALES.each do |localization_name|     # localization_name is, e.g. "de"     localization_name_dot_yml = localization_name + '.yml'     localization_file_name = File.join('lib/locale',localization_name_dot_yml)     yaml_str = IO.read(localization_file_name)

    utf_8__3_byte_indicator = "\xEF\xBB\xBF"     if yaml_str[0..2] == utf_8__3_byte_indicator       yaml_str = yaml_str[3...yaml_str.size]       File.open(localization_file_name,"w") { |f| f << yaml_str }       puts localization_file_name + ' has had the UTF-8 indicator bytes removed'     end   end

- - -

Suggestions and comments are welcome.

What I had done was load the ANSI (i.e. iso-8859-1) localization file into Notepad, convert to UTF-8, and saved that file.

<…>

It turns out that Notepad places "\xEF\xBB\xBF" at the beginning of the file to indicate that this is a YAML file.

This is not to indicate a YAML file (I doubt Notepad knows that YAML is at all). This is Byte-Order-Mark Byte order mark - Wikipedia

Gimme a break!

Note only does Notepad put in these indicator bytes ... so does TextMate.

<…>

How thoughtful that TextMate does what the article says it should not do. If there is a way to turn off that behavior, I can't find it. Maybe there's a TextMate bundle ... who knows?

Really? Never saw Textmate to do that. Are you sure you did not just loaded file saved elsewhere with BOM?

Regards, Rimantas

How thoughtful that TextMate does what the article says it should not do. �If there is a way to turn off that behavior, I can't find it. Maybe there's a TextMate bundle ... who knows?

Really? Never saw Textmate to do that. Are you sure you did not just loaded file saved elsewhere with BOM?

Yes ... absolutely certain.

I use a hex editor to remove the BOM ... resave.

I examine the file with another hex editor ... the BOM is not there.

I go into TextMate ... load the file ... resave ... and the BOM reappears.

This only happens if TextMate detects UTF-8 characters in the file.

Ralph Shnelvar wrote:

How thoughtful that TextMate does what the article says it should not do. �If there is a way to turn off that behavior, I can't find it. Maybe there's a TextMate bundle ... who knows?

Really? Never saw Textmate to do that. Are you sure you did not just loaded file saved elsewhere with BOM?

Yes ... absolutely certain.

I use a hex editor to remove the BOM ... resave.

I examine the file with another hex editor ... the BOM is not there.

I go into TextMate ... load the file ... resave ... and the BOM reappears.

This only happens if TextMate detects UTF-8 characters in the file.

Is there a setting to save as "UTF-8 without BOM" or something?

Best,

Marnen Laibow-Koser wrote:

Is there a setting to save as "UTF-8 without BOM" or something?

As I said earlier, if there is a setting, I can't find it.

TextMate has things call "bundles. These are mini-applications tht can be integrated into TextMate. Someone, somewhere may have figured out how to do it.

What UTF-8-compliant editor do you use, Marnen?

Marnen Laibow-Koser wrote:

Is there a setting to save as "UTF-8 without BOM" or something?

As I said earlier, if there is a setting, I can't find it.

Textmate does not save BOM for UTF-8 files.

Just choose save as, utf-8 and that's it.

Regards, Rimantas

Rimantas Liubertas wrote:

Textmate does not save BOM for UTF-8 files.

Just choose save as, utf-8 and that's it.

Oh, Geez, I feel like a complete idiot ...

I am using "e" as the text editor ... which the advertising says is "textmate for windows."

Sorry!

It is "e" that is saving BOM.

Ralph Shnelvar wrote:

Marnen Laibow-Koser wrote:

Is there a setting to save as "UTF-8 without BOM" or something?

As I said earlier, if there is a setting, I can't find it.

TextMate has things call "bundles. These are mini-applications tht can be integrated into TextMate. Someone, somewhere may have figured out how to do it.

What UTF-8-compliant editor do you use, Marnen?

I mostly use KomodoEdit, for whatever it's worth; also sometimes jEdit, NetBeans, TextWrangler, Eclipse/Aptana...

Best,