Quoting PierreW <wamrewam@googlemail.com>:
Hello,
I need to parse two big XML files in a row (30+MB each). I have tried
both REXML and Hpricot. They do work. Thing is, with both libraries,
the parsing of each file takes a huge amount of memory: more than
700MB each!
So I was wondering:
- is it normal that parsing a 30MB file takes 700MB of memory? Could
it be that something is wrong with the file? Is there an alternative
way to deal with such big files?
- is there a way to force the release of the memory when I don't need
the file anymore? At the moment it is not released instantly after the
first file, so I end up with 1.5GB memory use.
Generally XML libraries keep the whole content, including whitespace, in an
easily searchable tree structured data structure, often plus the original
text, plus overhead.
I have reduced the code to the minimum to isolate the memory issue:
xml = File.read("myfile.xml")
doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml)
doc = nil
and repeat with the second file.
Also, I tried libxml in case. I get an error message that I can't
explain:
LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate
encoding ! yet the file is UTF-8 as far as I can tell.
LibXML is very picky about UTF-8 and I have not been able to figure how to get
it to recover and continue parsing. Since Hpricot and Nokogiri are less
picky and they use LibXML, I presume it is possible.
Whenever I have dug into the source file in question, there has been a
non-UTF8 character. Look at the reader.line_number and reader.column_number
values for where.
As another person has suggested, the SAX API does not keep the whole file or
DOM in memory and uses much less memory. Also look at XML::Reader interface,
it is very fast and not at all memory hungry. Your code will probably not be
as pretty as with the DOM APIs, but sometimes it is worth the trade offs.
Switching from FeedTools to read RSS feeds to XML::Reader to grab just what my
app needs resulted in a speed up of better than 10x, maybe 100x. This is
applicable if your code is handling multiple XML schemas (there are at least 6
different RSS schema with varying interpretations, for some fields the whole
DOM is searched 10 times; once thru with custom code is ugly, but worth it in
my application).
HTH,
Jeffrey