Hpricot problems when scraping sites with accented characters

Hi,

I'm using Hpricot to try and grab reviews from various sites, some of
these reviews are in French or German, etc, and so contain accented
characters.

However, these are coming out the other end as a load of question
marks.

I'm suspecting this is some kind of encoding issue, and the best of my
googling has revealed that Ruby and Rails kinda suck at character
encodings.

I've tried blindly adding the following to my environment.rb:

$KCODE = 'u'
require 'jcode'

but it doesn't seem to have helped at all.

Any idea what I can try next? am I likely to be able to get this
working?

Thanks,

Luke.

lukens wrote:

Hi,

I'm using Hpricot to try and grab reviews from various sites, some of
these reviews are in French or German, etc, and so contain accented
characters.

However, these are coming out the other end as a load of question
marks.

I'm suspecting this is some kind of encoding issue, and the best of my
googling has revealed that Ruby and Rails kinda suck at character
encodings.

I've tried blindly adding the following to my environment.rb:

$KCODE = 'u'
require 'jcode'

but it doesn't seem to have helped at all.

Any idea what I can try next? am I likely to be able to get this
working?
  
- detect the encoding with chardet,
- use Iconv to convert the original content to utf-8,
- only then use Hpricot to parse it.

Lionel

Lionel Bouton wrote:

- detect the encoding with chardet,
- use Iconv to convert the original content to utf-8,
- only then use Hpricot to parse it.

Would Tidy -utf do the first two steps automatically?

Phlip wrote:

Lionel Bouton wrote:

- detect the encoding with chardet,
- use Iconv to convert the original content to utf-8,
- only then use Hpricot to parse it.
    
Would Tidy -utf do the first two steps automatically?

Never tested (and already coded a working chardet + Iconv implementation).

From tidy's documentation, it seems it would request an encoding, but
not force it, so buggy servers will still crash your code.

In fact I had to use a begin Iconv.iconv('utf-8', 'utf-8') rescue ....
end to make absolutely sure results are really utf-8 (you don't want bad
encoding trying to enter a database set to use UTF-8...)

Lionel

Thanks for the responses, but could you elaborate a little please?

at the moment I have:

Hpricot(open(uri))

(with a "require 'open-uri'" at the top)

What do I need to do, and where?

lukens wrote the following on 07.08.2007 20:20 :

Thanks for the responses, but could you elaborate a little please?

at the moment I have:

Hpricot(open(uri))

(with a "require 'open-uri'" at the top)

What do I need to do, and where?
  
open(uri) gives you a String in an unknown encoding. Hpricot expects
UTF-8, so you must make sure that the String you get is converted to
UTF-8, to do so you must use the Iconv library but it expects you to
know which encoding the source is in. The chardet library will be able
to guess the original encoding.

For the details, look up the documentation of chardet and Iconv. Iconv
is in the standard library, chardet is a separate download.

Lionel

thanks for the help.

open(uri) returns a File, rather than a String, and after playing
with various options for detecting the encoding, I found that the file
object has a charset method, which returns the encoding (I think this
is only on a file returned by open-uri).

This was handy as chardet seemed pretty crap at detecting the encoding
correctly, it was slightly better when I tried doing it a line at a
time, but for the whole file, it just sucked. I still have a fallback
to chardet if the file object doesn't respond to 'charset'.

I should note that I was using rchardet as I couldn't get the chardet
gem to play ball at all.