Hpricot problems when scraping sites with accented characters

Hi,

I'm using Hpricot to try and grab reviews from various sites, some of these reviews are in French or German, etc, and so contain accented characters.

However, these are coming out the other end as a load of question marks.

I'm suspecting this is some kind of encoding issue, and the best of my googling has revealed that Ruby and Rails kinda suck at character encodings.

I've tried blindly adding the following to my environment.rb:

$KCODE = 'u' require 'jcode'

but it doesn't seem to have helped at all.

Any idea what I can try next? am I likely to be able to get this working?

Thanks,

Luke.

lukens wrote:

Hi,

I'm using Hpricot to try and grab reviews from various sites, some of these reviews are in French or German, etc, and so contain accented characters.

However, these are coming out the other end as a load of question marks.

I'm suspecting this is some kind of encoding issue, and the best of my googling has revealed that Ruby and Rails kinda suck at character encodings.

I've tried blindly adding the following to my environment.rb:

$KCODE = 'u' require 'jcode'

but it doesn't seem to have helped at all.

Any idea what I can try next? am I likely to be able to get this working?    - detect the encoding with chardet, - use Iconv to convert the original content to utf-8, - only then use Hpricot to parse it.

Lionel

Lionel Bouton wrote:

- detect the encoding with chardet, - use Iconv to convert the original content to utf-8, - only then use Hpricot to parse it.

Would Tidy -utf do the first two steps automatically?

Phlip wrote:

Lionel Bouton wrote:

- detect the encoding with chardet, - use Iconv to convert the original content to utf-8, - only then use Hpricot to parse it.      Would Tidy -utf do the first two steps automatically?

Never tested (and already coded a working chardet + Iconv implementation).

From tidy's documentation, it seems it would request an encoding, but not force it, so buggy servers will still crash your code.

In fact I had to use a begin Iconv.iconv('utf-8', 'utf-8') rescue .... end to make absolutely sure results are really utf-8 (you don't want bad encoding trying to enter a database set to use UTF-8...)

Lionel

Thanks for the responses, but could you elaborate a little please?

at the moment I have:

Hpricot(open(uri))

(with a "require 'open-uri'" at the top)

What do I need to do, and where?

lukens wrote the following on 07.08.2007 20:20 :

Thanks for the responses, but could you elaborate a little please?

at the moment I have:

Hpricot(open(uri))

(with a "require 'open-uri'" at the top)

What do I need to do, and where?    open(uri) gives you a String in an unknown encoding. Hpricot expects UTF-8, so you must make sure that the String you get is converted to UTF-8, to do so you must use the Iconv library but it expects you to know which encoding the source is in. The chardet library will be able to guess the original encoding.

For the details, look up the documentation of chardet and Iconv. Iconv is in the standard library, chardet is a separate download.

Lionel

thanks for the help.

open(uri) returns a File, rather than a String, and after playing with various options for detecting the encoding, I found that the file object has a charset method, which returns the encoding (I think this is only on a file returned by open-uri).

This was handy as chardet seemed pretty crap at detecting the encoding correctly, it was slightly better when I tried doing it a line at a time, but for the whole file, it just sucked. I still have a fallback to chardet if the file object doesn't respond to 'charset'.

I should note that I was using rchardet as I couldn't get the chardet gem to play ball at all.