ActiveRecord: ar-each_model

Hi,

today I went into a problem, when I had to iterate over a big result
set. ActiveRecord produces a huge array of model instances, which
consumed 1GB of memory on my machine. A comparable perl DBI script
only used some KB for iterating over the result set.

Then I was suggested to use batch (#find_each) for the problem. But
basically batch does the same thing by splitting the query into
several queries, which only return 1000 model instances in an array,
which still consumes more memory than necessary, but not that much.
The problem: it was much slower (it took 25 minutes, while the version
without batch took 90 seconds and the perl script took only 40
seconds) on my legacy database.

So I was searching for a method, which yields a model instance at a
time from the result set of the query, but I could not find anything.

At the end, I tried to find a solution myself and so I came up to this
simple solution

https://github.com/tvw/ar-each_model

I want to put into discussion here, since I wonder, why similar things
do not exist in ActiveRecord.

Regards
Thomas

tvw wrote in post #1040336:

today I went into a problem, when I had to iterate over a big result
set. ActiveRecord produces a huge array of model instances, which
consumed 1GB of memory on my machine. A comparable perl DBI script
only used some KB for iterating over the result set.

It is well known that ActiveRecord objects are pretty heavy weight. My
guess is that your Perl script is instead returning you something very
light weight. Something that is little more than key value pairs for
each database row.

I'm also assuming that whatever you're doing with this large result set
probably doesn't require "smart" heavy weight model objects.

In the past I used a framework that had a similar issue. It, however,
had a built-in mechanism for dealing with the issue. It had something
called "raw row fetching." Rather than returning true model objects,
with all the intelligence built into them, you could opt to fetch raw
rows that were represented by an array of dictionaries (hashes). A raw
row could then be transformed into a full fledged model object on
demand.

I don't know if ActiveRecord provides anything similar to this
out-of-the-box. But, I'm sure someone must have developed something like
this for Rails.

Hi,

today I went into a problem, when I had to iterate over a big result
set. ActiveRecord produces a huge array of model instances, which
consumed 1GB of memory on my machine. A comparable perl DBI script
only used some KB for iterating over the result set.

How many records was this? 1gb is pretty crazy.

Then I was suggested to use batch (#find_each) for the problem. But
basically batch does the same thing by splitting the query into
several queries, which only return 1000 model instances in an array,
which still consumes more memory than necessary, but not that much.
The problem: it was much slower (it took 25 minutes, while the version
without batch took 90 seconds and the perl script took only 40
seconds) on my legacy database.

This sounds like your legacy DB is somehow not indexing the ID column
- after all, instantiating all those objects only takes 90 seconds.

--Matt Jones

Hi Robert,

tvw wrote in post #1040336:

> today I went into a problem, when I had to iterate over a big result
> set. ActiveRecord produces a huge array of model instances, which
> consumed 1GB of memory on my machine. A comparable perl DBI script
> only used some KB for iterating over the result set.

It is well known that ActiveRecord objects are pretty heavy weight. My
guess is that your Perl script is instead returning you something very
light weight. Something that is little more than key value pairs for
each database row.

ActiveRecord objects are by far not that heavy weight, but it is true,
that the perl script returns just a hash for each row. So if I would
get an ActiveRecord object for each row, I would not have tried to
find a better solution. But ActiveRecord produces an entire
Enumeration object containing all records of the query as ActiveRecord
objects, before it starts returning the first object in the loop. If
the perl script would produce an array of hashes, which is the light
weight counterpart, it would consume a massive and very noticeable
amount of memory too, though not as much as ActiveRecord. But in the
perl script, there never exists more than one hash at the same time,
which is the last row retrieved from the database.

I'm also assuming that whatever you're doing with this large result set
probably doesn't require "smart" heavy weight model objects.

Yes, that is true, but dealing with the objects is more comfortable,
since they are smart and it just costs me only a few bytes and only a
little more time with the solution, I found for the problem.

In the past I used a framework that had a similar issue. It, however,
had a built-in mechanism for dealing with the issue. It had something
called "raw row fetching." Rather than returning true model objects,
with all the intelligence built into them, you could opt to fetch raw
rows that were represented by an array of dictionaries (hashes). A raw
row could then be transformed into a full fledged model object on
demand.

I don't know if ActiveRecord provides anything similar to this
out-of-the-box. But, I'm sure someone must have developed something like
this for Rails.

The ActiveRecord database adapters provide such a mechanism, and that
is what my solution uses: It retrieves the rows from the database as
hashes and generates an ActiveRecord object, which it then yields.
This just costs a few bytes overhead over raw hashes and a little more
time.

The funny thing was, that with that solution the results in my sqlite3
demo database worked as expected, while against my legacy sqlserver
database, it consumed a lot of memory too. The reason was, that the
underlying TinyTDS database driver, which the sqlserver adapter for
ActiveRecord uses, caches each record it retrieves by default and must
be turned off.

So the funny thing is, when you use the ActiveRecord adapter, you do
not only have the memory consumption, which results in AR building an
Enumaration of objects, but on top of this, you have a cache of all
database rows in the driver too, since the adapter uses the default
option, as far as I can see.

Regards
Thomas

> Hi,

> today I went into a problem, when I had to iterate over a big result
> set. ActiveRecord produces a huge array of model instances, which
> consumed 1GB of memory on my machine. A comparable perl DBI script
> only used some KB for iterating over the result set.

How many records was this? 1gb is pretty crazy.

400000 records

> Then I was suggested to use batch (#find_each) for the problem. But
> basically batch does the same thing by splitting the query into
> several queries, which only return 1000 model instances in an array,
> which still consumes more memory than necessary, but not that much.
> The problem: it was much slower (it took 25 minutes, while the version
> without batch took 90 seconds and the perl script took only 40
> seconds) on my legacy database.

This sounds like your legacy DB is somehow not indexing the ID column
- after all, instantiating all those objects only takes 90 seconds.

Matt, the ID column is indexed, but probably not all fields the query
uses, since the table is used for logging events and must be fast for
writing rather than reading. The 400_000 records I retrieve, are
400_000 among millions of records, which lie close to each other, but
not next to each other. And the database is in heavy production while
reading those data.

So some amount of the 40 seconds, the perl script runs, does the query
itself cost. When you now do 400 queries in batches rather than 1
query to retrieve all records, and the query itself may cost you only
1 second, you already have spent 400 seconds which is about 7 minutes
without retrieving and processing a single row.

Regards
Thomas