Parsing big XML files - memory issue

Hello,

I need to parse two big XML files in a row (30+MB each). I have tried both REXML and Hpricot. They do work. Thing is, with both libraries, the parsing of each file takes a huge amount of memory: more than 700MB each!

So I was wondering: - is it normal that parsing a 30MB file takes 700MB of memory? Could it be that something is wrong with the file? Is there an alternative way to deal with such big files? - is there a way to force the release of the memory when I don't need the file anymore? At the moment it is not released instantly after the first file, so I end up with 1.5GB memory use.

I have reduced the code to the minimum to isolate the memory issue:

xml = File.read("myfile.xml") doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml) doc = nil

and repeat with the second file.

Also, I tried libxml in case. I get an error message that I can't explain: LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate encoding ! yet the file is UTF-8 as far as I can tell.

Thanks a lot for your help. Pierre

Hello,

I need to parse two big XML files in a row (30+MB each). I have tried both REXML and Hpricot. They do work. Thing is, with both libraries, the parsing of each file takes a huge amount of memory: more than 700MB each!

So I was wondering: - is it normal that parsing a 30MB file takes 700MB of memory? Could it be that something is wrong with the file? Is there an alternative way to deal with such big files?

DOM parsers can use up a lot of memory with large files (10x filesize or more). SAX parsers don't (because they don't keep the whole thing in memory - they just fire events as they traverse the dom). REXML does have a sax style parser, and libxml will have one too.

Fred

Quoting PierreW <wamrewam@googlemail.com>:

Hello,

I need to parse two big XML files in a row (30+MB each). I have tried both REXML and Hpricot. They do work. Thing is, with both libraries, the parsing of each file takes a huge amount of memory: more than 700MB each!

So I was wondering: - is it normal that parsing a 30MB file takes 700MB of memory? Could it be that something is wrong with the file? Is there an alternative way to deal with such big files? - is there a way to force the release of the memory when I don't need the file anymore? At the moment it is not released instantly after the first file, so I end up with 1.5GB memory use.

Generally XML libraries keep the whole content, including whitespace, in an easily searchable tree structured data structure, often plus the original text, plus overhead.

I have reduced the code to the minimum to isolate the memory issue:

xml = File.read("myfile.xml") doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml) doc = nil

and repeat with the second file.

Also, I tried libxml in case. I get an error message that I can't explain: LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate encoding ! yet the file is UTF-8 as far as I can tell.

LibXML is very picky about UTF-8 and I have not been able to figure how to get it to recover and continue parsing. Since Hpricot and Nokogiri are less picky and they use LibXML, I presume it is possible.

Whenever I have dug into the source file in question, there has been a non-UTF8 character. Look at the reader.line_number and reader.column_number values for where.

As another person has suggested, the SAX API does not keep the whole file or DOM in memory and uses much less memory. Also look at XML::Reader interface, it is very fast and not at all memory hungry. Your code will probably not be as pretty as with the DOM APIs, but sometimes it is worth the trade offs. Switching from FeedTools to read RSS feeds to XML::Reader to grab just what my app needs resulted in a speed up of better than 10x, maybe 100x. This is applicable if your code is handling multiple XML schemas (there are at least 6 different RSS schema with varying interpretations, for some fields the whole DOM is searched 10 times; once thru with custom code is ugly, but worth it in my application).

HTH,   Jeffrey

Hi Pierre,

I had a 45~50mb file to parse using Ruby libraries but to no avail, the DOM based libraries were slow to death and the SAX based one that I tried (libxml-ruby) had some serious memory leaks. Now there's this SaxMachine from paul dix that looks usable - http://www.pauldix.net/2009/01/sax-machine-sax-parsing-made-easy.html

As to my problem, I wrote a StAX based parser using Java to get it to run in reasonable time :frowning:

Hi all,

Thank you so much for pointing me in the right direction.

I used a REXML SAX2Parser: it solved my problem. It's a bit more code indeed, but it uses a fraction of the memory and it seems quite fast to me.

Thanks a lot, Pierre