Parsing big XML files - memory issue

Hello,

I need to parse two big XML files in a row (30+MB each). I have tried
both REXML and Hpricot. They do work. Thing is, with both libraries,
the parsing of each file takes a huge amount of memory: more than
700MB each!

So I was wondering:
- is it normal that parsing a 30MB file takes 700MB of memory? Could
it be that something is wrong with the file? Is there an alternative
way to deal with such big files?
- is there a way to force the release of the memory when I don't need
the file anymore? At the moment it is not released instantly after the
first file, so I end up with 1.5GB memory use.

I have reduced the code to the minimum to isolate the memory issue:

xml = File.read("myfile.xml")
doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml)
doc = nil

and repeat with the second file.

Also, I tried libxml in case. I get an error message that I can't
explain:
LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate
encoding ! yet the file is UTF-8 as far as I can tell.

Thanks a lot for your help.
Pierre

Hello,

I need to parse two big XML files in a row (30+MB each). I have tried
both REXML and Hpricot. They do work. Thing is, with both libraries,
the parsing of each file takes a huge amount of memory: more than
700MB each!

So I was wondering:
- is it normal that parsing a 30MB file takes 700MB of memory? Could
it be that something is wrong with the file? Is there an alternative
way to deal with such big files?

DOM parsers can use up a lot of memory with large files (10x filesize
or more). SAX parsers don't (because they don't keep the whole thing
in memory - they just fire events as they traverse the dom). REXML
does have a sax style parser, and libxml will have one too.

Fred

Quoting PierreW <wamrewam@googlemail.com>:

Hello,

I need to parse two big XML files in a row (30+MB each). I have tried
both REXML and Hpricot. They do work. Thing is, with both libraries,
the parsing of each file takes a huge amount of memory: more than
700MB each!

So I was wondering:
- is it normal that parsing a 30MB file takes 700MB of memory? Could
it be that something is wrong with the file? Is there an alternative
way to deal with such big files?
- is there a way to force the release of the memory when I don't need
the file anymore? At the moment it is not released instantly after the
first file, so I end up with 1.5GB memory use.

Generally XML libraries keep the whole content, including whitespace, in an
easily searchable tree structured data structure, often plus the original
text, plus overhead.

I have reduced the code to the minimum to isolate the memory issue:

xml = File.read("myfile.xml")
doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml)
doc = nil

and repeat with the second file.

Also, I tried libxml in case. I get an error message that I can't
explain:
LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate
encoding ! yet the file is UTF-8 as far as I can tell.

LibXML is very picky about UTF-8 and I have not been able to figure how to get
it to recover and continue parsing. Since Hpricot and Nokogiri are less
picky and they use LibXML, I presume it is possible.

Whenever I have dug into the source file in question, there has been a
non-UTF8 character. Look at the reader.line_number and reader.column_number
values for where.

As another person has suggested, the SAX API does not keep the whole file or
DOM in memory and uses much less memory. Also look at XML::Reader interface,
it is very fast and not at all memory hungry. Your code will probably not be
as pretty as with the DOM APIs, but sometimes it is worth the trade offs.
Switching from FeedTools to read RSS feeds to XML::Reader to grab just what my
app needs resulted in a speed up of better than 10x, maybe 100x. This is
applicable if your code is handling multiple XML schemas (there are at least 6
different RSS schema with varying interpretations, for some fields the whole
DOM is searched 10 times; once thru with custom code is ugly, but worth it in
my application).

HTH,
  Jeffrey

Hi Pierre,

I had a 45~50mb file to parse using Ruby libraries but to no avail,
the DOM based libraries were slow to death and the SAX based one that
I tried (libxml-ruby) had some serious memory leaks. Now there's this
SaxMachine from paul dix that looks usable -
http://www.pauldix.net/2009/01/sax-machine-sax-parsing-made-easy.html

As to my problem, I wrote a StAX based parser using Java to get it to
run in reasonable time :frowning:

Hi all,

Thank you so much for pointing me in the right direction.

I used a REXML SAX2Parser: it solved my problem. It's a bit more code
indeed, but it uses a fraction of the memory and it seems quite fast
to me.

Thanks a lot,
Pierre