I am parsing XML streams with ruby-libxml using the XML::Reader class. Several have invalid UTF-8 characters. I need a tutorial or at least some hints on how to recover and continue the parsing.
TIA, Jeffrey
I am parsing XML streams with ruby-libxml using the XML::Reader class. Several have invalid UTF-8 characters. I need a tutorial or at least some hints on how to recover and continue the parsing.
TIA, Jeffrey
Jeffrey L. Taylor wrote:
I am parsing XML streams with ruby-libxml using the XML::Reader class. Several have invalid UTF-8 characters. I need a tutorial or at least some hints on how to recover and continue the parsing.
Why not scrub them with Ruby's built-in iconv first?
And what are they doing to you and ruby-libxml? I have found libxml2 suspiciously forgiving, so far...
Quoting Phlip <phlip2005@gmail.com>:
Jeffrey L. Taylor wrote: > I am parsing XML streams with ruby-libxml using the XML::Reader class. > Several have invalid UTF-8 characters. I need a tutorial or at least some > hints on how to recover and continue the parsing.
Why not scrub them with Ruby's built-in iconv first?
And what are they doing to you and ruby-libxml? I have found libxml2 suspiciously forgiving, so far...
Throws an exception. It took a bunch of digging to find line: 835, character: 418 is truely not an UTF-8 character (octal 240, maybe a Latin-1 character?). I'd like to delete or replace it with a question mark and continue parsing. It is a rather large file so I'd rather not read the whole thing into memory to correct. I suppose I could wrap the read function in a clean up function. Messy trying to keep state for UTF-8 across partial reads.
I was hoping for something better.
Jeffrey