how can I get malformed UTF-8 characters to display properly?

nonrecursive · August 27, 2007, 2:31pm

Hello everyone,

I'm scraping a lot of sites for a project, and occasionally the scraped content will have "malformed UTF-8" characters. When the scraped content is processed (basically a database record is created), these characters often don't appear as they're supposed to.

Normally, the following code works great:

str.unpack("U*").collect {|s| (s > 127 ? "&##{s};" : s.chr) }.join("")

But it won't work with these "malformed UTF-8" characters. So I've written the following to handle these characters, but it still isn't perfect. For example, I scraped this page: http://web.mac.com/j3mbeck/iWeb/JohnBeckPaper_Steel/Fireplace%20Surrounds.html

The alt attribute of the first thumbnail, steel surround, contains the text "Steel has that effect where you'd least expect it". The ' character shows up as Õ when I use the method below, and the "d" is just swallowed.

data.gsub!(/\323/, '"')

require 'oniguruma'

o = Oniguruma::ORegexp.new('[^[:ascii:]]') # o = Oniguruma::ORegexp.new('[^[:ascii:]]', {:encoding => Oniguruma::ENCODING_UTF8}) chars = data.each_char{|c|chars << c} chars.collect do |c| if o.match c begin "&##{c.unpack('U*').first};" rescue ArgumentError add_log_message("Has malformed UTF-8 characters") #handling malformed UTF-8 : a huge pain and possibly future cause of problems bytes = c.each_byte{|b| bytes << b} # assumes we're handling at most, 2-byte strings. We have no way if the malformed character is # supposed to be one byte or two, but we're assuming it's 1. ["&##{bytes[0]}"] + bytes[1..-1].collect{|b|b.chr} end else c end end.flatten.join('')

Any suggestions?

Thanks! Daniel

Topic		Replies	Views
Problem with GET args and UTF-8 encoding (output of Rack::Utils.unescape() ?) rubyonrails-talk	3	242	May 17, 2011
gsub ignorecase unicode rubyonrails-talk	0	113	December 12, 2007
invalid byte sequence in UTF-8 , need to re-encode ? rubyonrails-talk	1	165	September 7, 2010
supporting iso-8859-1 rubyonrails-talk	1	144	November 7, 2007
encoding problems (i think utf-8 problems) rubyonrails-talk	2	180	March 17, 2007

how can I get malformed UTF-8 characters to display properly?

Related topics

More Resources