libxml-ruby sax parsing open-uri

Hi there,

I need to connect to an url to download and process an XML document. Then run through the XML document and save elements in the database.

There are many howto's on the internet regarding parsing xml files with SAX opening a file on the filesystem and reading through it. But I could not find an example of how to read an URL while processing the xml.

SAX will be useless if the content from the URL has to be downloaded completely before processing it. The RAM will still fill up.

Is there somebody that has a solution for this problem or maybe a sample snippet on how to deal with this in Ruby or Rails? I don't care if it is libxml, rexml or something else as long less RAM will be used.

Thanks for your help. Chris

Hi there,

I need to connect to an url to download and process an XML document. Then run through the XML document and save elements in the database.

There are many howto's on the internet regarding parsing xml files with SAX opening a file on the filesystem and reading through it. But I could not find an example of how to read an URL while processing the xml.

SAX will be useless if the content from the URL has to be downloaded completely before processing it. The RAM will still fill up.

Not sure if this will help, but if you use the open-uri gem, you can open a file from a URL. I use it in a converter I'm working on right at this moment:

require 'rubygems' require 'nokogiri' require 'open-uri'

#here I'm loading the xsd from W3 directly xsd = Nokogiri::XML::Schema(open('http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd’))

...etc...

I'm not at all sure that this will save you on RAM, I'm loading temp files in another part of this script (from the filesystem) and ripping through them with regular expressions one line at a time, but after all that's done, I open the partially-transformed file with Nokogiri in one large bite and do all sorts of things to it. Some of these files are 10 - 20MB of XML text. It's currently working fine inside a hard limit of 2GB of RAM. I wouldn't be surprised if Nokogiri does some very clever things to manage its memory footprint, because it certainly works much more efficiently than the previous generation of this system, which used XSLT and Saxon, and crapped out over 6MB of input.

Walter

Quoting Chris Armstrong <lists@ruby-forum.com>:

Hi there,

I need to connect to an url to download and process an XML document. Then run through the XML document and save elements in the database.

There are many howto's on the internet regarding parsing xml files with SAX opening a file on the filesystem and reading through it. But I could not find an example of how to read an URL while processing the xml.

The Ruby wrapper around the libxml2 C library (libxml-ruby) supports the DOM (parse whole file into a data structure that can be searched, editted, written back out), SAX, and Reader models. All will handle any IO like class. They don't have to have the whole input in memory, but can repeated call IO#read to parse data as read rather than all at once. Unless you have a hard requirement to use SAX, look at the Reader model. IMHO, it is easier to use. The documentation (http://libxml.rubyforge.org/rdoc/index.html) is very good. Better than libxml2's documentation, IMHO.

HTH,   Jeffrey