help with xml -> database insertion

Hello,

I'm grabbing a few xml documents, representing 100 - 300 database
rows in total, parsing them, and inserting them into my mysql
database. I'm using rexml/document, parsing each entire xml document
(some are 100k), and iterating through them, line by line, and putting
them into my database, line by line.

It is painfully slow.

What is the best strategy for grabbing xml, parsing it, and getting it
into my database?

Charlie

Use a faster parser? hpricot comes to mind, part of it is written in
C. (It styles itself an HTML parser, but it should be able to do what
you want).

Vish

charlie caroff wrote:

Hello,

I'm grabbing a few xml documents, representing 100 - 300 database
rows in total, parsing them, and inserting them into my mysql
database. I'm using rexml/document, parsing each entire xml document
(some are 100k), and iterating through them, line by line, and putting
them into my database, line by line.

It is painfully slow.

What is the best strategy for grabbing xml, parsing it, and getting it
into my database?

Charlie

How slow? In my opinion (I've used XML in Ruby rather than in Rails), you should do one of the following:
1. Switch to C for the parsing. You can get it to dump out something like a CSV (and then don't use the built-in CSV parser - use File and split to parse that)
2. Do not load the whole document into memory. Use the stream parser mode instead. I have a 5MB XML file with about 50,000 records and it takes about 6 minutes to load on my laptop before I can do anything with it. Also, my Ruby process' memory usage increases to about 450MB. In stream parsing mode, it is much faster. This may or may not be fast enough for you, so try this (it's easy to do) and then switch to #1 if it doesn't help enough.

Hope this helps.

Cheers,
Mohit.
8/13/2007 | 9:13 AM.

Thanks. What about getting the data into my database? Is it best to
do it line by line as the stream comes in?

Charlie

charlie caroff wrote:

Thanks. What about getting the data into my database? Is it best to
do it line by line as the stream comes in?

Charlie
  
Hi Charlie,

I'm not doing what you are working on. Is this a 1-time task, something that happens every now and then, or something that is a frequent occurrence? Is it done in the backend or is it done by the user uploading an XML file?

Basically, if you're doing it line by line, it's simple - read a line, break up the data into individual parameters, create an ActiveRecord object with those parameters, save it to the database. Repeat!

I'm not sure what kind of performance you will get - depending on how frequently you need to do this (and what the response time needs to be), this may meet your needs. If not, you could use a number of ways to handle it, including using a module in some other language, if required.

Cheers,
Mohit.
8/13/2007 | 11:29 PM.

Using a single insert is obviously faster. AR doesn't have any inbuilt
support for this so you might have to write custom SQL in the model.

http://www.desilva.biz/mysql/insert.html

Vish