2 ActiveRecord features I can offer: eager loading, optimized updates

Hey all,

So there are a couple features that my team would find useful in
ActiveRecord, that I can volunteer our time to work on. I mainly want
to see if there's going to be support in the Rails core team to these
changes into core before doing a full dev cycle on them.

Please let me know if you support these ideas. If so, my team can work
on implementing them.

1) Eager association loading

Currently, if you have a model like this:

   class Forum < AR
     has_many :posts
   end
   class Post < AR
     belongs_to :forum
   end

Then walk thru these records:

   <% for forum in @forums %>
      Title: <%= forum.post.subject %>
   <% end %>

Currently, this will do an extra SQL select * from post for each loop.

DataMapper (datamapper.org) has an interesting approach: The first
time that a sub-association is loaded, then all associations are
preloaded, since most times you're in a loop (as above). While
the :include option helps with this somewhat, when passing collection
objects around, you often don't know how the associations are going to
be used.

2) Optimized updates

Currently, if one attribute is changed in an AR model, then the entire
model is parsed and sent as an UPDATE statement. This is not really
desirable from a network or database standpoint.

There are already some counters in AR for monitoring changes. Is
there a per-attribute @changed data structure? It seems like this
could just be set to true when updating a column. Then, on save, only
those with the @changed flag are actually converted to SQL.

These are changes we really need for our local copy anyways.
Thoughts?

Thanks,
Nate

My forum example was terrible. Here's what I meant:

   <% for post in @posts %>
      Title: <%= post.forum.title %>
   <% end %>

Not a perfect example, but you get the idea. Ideally, you'd do one
bulk select, similar to how :include works, on the first missing
record

   select * from forums where id in (12, 157, 2 , 34)

You would get the id list from a distinct list of all values in
post.forum_id

-Nate

Currently, this will do an extra SQL select * from post for each loop.

DataMapper (datamapper.org) has an interesting approach: The first
time that a sub-association is loaded, then all associations are
preloaded, since most times you're in a loop (as above). While
the :include option helps with this somewhat, when passing collection
objects around, you often don't know how the associations are going to
be used.

There's certainly value to be found here. I've definitely been in the case where one consumer of a method or an association would quite like a whole bunch of things included, but the other wouldn't.

2) Optimized updates

Currently, if one attribute is changed in an AR model, then the entire
model is parsed and sent as an UPDATE statement. This is not really
desirable from a network or database standpoint.

There are already some counters in AR for monitoring changes. Is
there a per-attribute @changed data structure? It seems like this
could just be set to true when updating a column. Then, on save, only
those with the @changed flag are actually converted to SQL.

I can certainly see where you're coming from and something similar has often crossed my mind. The big issue I have with this one is consistency:
Currently I can look at an object, call save! and I know that what's in the db is what I just saw, whereas if you only changed the attributes you touched then you can't be sure of this. There's cases where the right thing to do is probably just to use locking of one form or another, but I can't help but feel that's a bit overkill for the simple case.

Fred

+1 for this, be sure to read Pratik’s writeup http://m.onkey.org/2007/12/18/partial-updates

It couldn’t replace current behavior right away (Frederick explained why), but it could have its own option (or even a method).

These should, in my opinion, do partial updates by default: update_attribute, update_attributes.

Agreed, especially update_attribute. The name just screams "I'm going to change one attribute" but then it resaves the whole object.

Fred

I don't think we'd want to change this default unconditionally.
Perhaps update_attribute(s) should do a partial update iff the table
had a lock attribute, or the model otherwise declared the desire for
partial update.

Interesting post by Pratik. I don't know how much real-world impact
it would really have, though. I've never run into the situation where
I'm save()'ing a record, followed by doing lots more stuff with it.
Normally you redirect and go to another resource.

Nevertheless, I understand why we couldn't change the default
behavior.

For both of these features, I'd recommend AR::B class settings so the
user can control them:

   ActiveRecord::Base.eager_loading = true
   ActiveRecord::Base.partial_updates = true

That way, it could be set in environment.rb, and overridden on a per-
model basis:

   environment.rb
     ActiveRecord::Base.eager_loading = true
     ActiveRecord::Base.partial_updates = true

   class CouldHaveInconsistency < AR::Base
      partial_updates = false
   end

Finally, if a user wanted to do optimistic locking to better ensure
their partial_updates are ok, then they would do that simply by adding
lock_version as is currently. However, I think the two pieces of
functionality should be kept separate, so the user can choose.

Sound good?

Finally, if a user wanted to do optimistic locking to better ensure
their partial_updates are ok, then they would do that simply by adding
lock_version as is currently. However, I think the two pieces of
functionality should be kept separate, so the user can choose.

Sound good?

This has come up a bunch of times here and doesn't tend to get
resolved. In hibernate this feature is called 'dynamic-update', and
you can have a look at their forums etc to see the potential issues
that people hit. It's worth noting that this feature requires
optimistic locking.

I'm certainly interested in seeing patches which implement partial
updates like this, but it's simply not safe to use unless the object
is locked in some way. Otherwise you have *no* way to know that the
values in a row correspond to a valid object. This isn't something
that rails should allow, the risks of corrupted / invalid data FAR
outweigh the difficulty of adding a locking column.

1) Eager association loading

Currently, if you have a model like this:

   class Forum < AR
     has_many :posts
   end
   class Post < AR
     belongs_to :forum
   end

Then walk thru these records:

   <% for forum in @forums %>
      Title: <%= forum.post.subject %>
   <% end %>

Currently, this will do an extra SQL select * from post for each loop.

That's not completely true true. ActiveRecord has basic query
caching, so (in your example), each forum would only be loaded once.
There's been interest in porting my active_record_context plugin into
ActiveRecord though. I have a writeup on it at
http://activereload.net/2007/5/23/spend-less-time-in-the-database-and-more-time-outdoors.
It's basically a simpler, more manual implementation of DM's
association eager loading. There's been some work done to the eager
including functionality that makes my plugin less useful, but
something more automatic would be welcome.

2) Optimized updates

Currently, if one attribute is changed in an AR model, then the entire
model is parsed and sent as an UPDATE statement. This is not really
desirable from a network or database standpoint.

Jeremy (bitsweat) has a plugin for tracking updates to ActiveRecord
attributes: http://code.bitsweat.net/svn/dirty/README It sounds like
this will be going into core. With that, it'd make it pretty easy to
perform dynamic updates. Koz's concerns are definitely valid
though...

DataMapper (datamapper.org) has an interesting approach: The first
time that a sub-association is loaded, then all associations are
preloaded, since most times you're in a loop (as above). While
the :include option helps with this somewhat, when passing collection
objects around, you often don't know how the associations are going to
be used.

Delayed execution is nice. Definitely worth exploring.

2) Optimized updates

Jeremy Kemper has a dirty tracking plugin that's going into Active
Record shortly. That'll solve the first part of the problem (knowing
what changed). The second part, just saving what changed, is a little
bit more tricky. Records can easily have interdependent fields where a
record is only valid if the two fields together fulfill some
requirement. If you save only changed fields, you risk ending up with
an invalid object. Locking can help mitigate this, but it's something
that needs to be considered as part of the implementation.

> 2) Optimized updates

Jeremy Kemper has a dirty tracking plugin that's going into Active
Record shortly. That'll solve the first part of the problem (knowing
what changed). The second part, just saving what changed, is a little
bit more tricky. Records can easily have interdependent fields where a
record is only valid if the two fields together fulfill some
requirement. If you save only changed fields, you risk ending up with
an invalid object. Locking can help mitigate this, but it's something
that needs to be considered as part of the implementation.

Actually, I'm coming around on this. The common case is that you don't
have validation concerns that implicate multiple attributes. The
common case is that it's nice if someone updating the title of a post
won't overwrite the separate changes made to the body by someone else.
We should certainly make it possible and easy to enforce full-
integrity writes, but I think the work should lie with people doing
that.

Naturally, we can't just change this as existing implementations may
break. But I think we should make partial updates the default for apps
created with 2.1. And then finally the default for all by 3.0. We can
do this by making an option like config.active_record.partial_updates
= true that's only set for apps created post-2.1 and when that option
is not set, we complain in the log about the deprecated status.

This should ensure that old apps continue to work just fine and can
upgrade to the partial-update style at their leisure while everything
new enjoys the revised style.

Worth noting is that the dirty tracking that facilitates this relies
on no changes in place. So post.title << "more" won't trigger a dirty
mark and thus won't include post.title in an update. Which is another
reason to make sure that existing apps don't get the partial updates
before they're ready for it.

Naturally, we can't just change this as existing implementations may
break. But I think we should make partial updates the default for apps
created with 2.1. And then finally the default for all by 3.0. We can
do this by making an option like config.active_record.partial_updates
= true that's only set for apps created post-2.1 and when that option
is not set, we complain in the log about the deprecated status.

I think an ever better and safer option is to add optimistic locking
based on updated_at. Most models have this now, and it's a useful
field (compared to an opaque incrementing integer). That way we can
enable it by default, and not worry about busted validations.

If we really want to we can then add a "I know what I'm doing, do
partial updates without a lock" option, but the fact that we'd be the
first orm to do this makes me question the sanity of even considering
that option :).

Uniqueness of timestamps is not a safe assumption. It will find a way
to bite you, eventually.

> I think an ever better and safer option is to add optimistic locking
> based on updated_at.

Uniqueness of timestamps is not a safe assumption. It will find a way
to bite you, eventually.

Particularly fun when you get the code working smoothly on the dev
box, only to fail randomly on more expensive and faster deployment
server.

Version numbers work better, handled by optimistic locking, and a
cheap way to calculate ETag on your models.

Assaf

This is certainly a risk, but it's much less likely than it seems at
first. The particular values of the timestamp column are just opaque
keys used for checking no one else has updated it. I certainly agree
that a version number is the way to go, but if we wanted to push
partial updates out more widely, this would be a low risk way of doing
it.

In distant, but painful memory, it turned out to be *more* likely than
it seemed at first. Likely enough, in that particular scenario, that
it actually happened once in a couple of weeks. And by virtue of being
unreproducible, this bug caused a lot of hurt.

Given that the entire problem that optimistic locking is trying to solve is multiple people fiddling with something at the same time you'd think the odds of a 'collision' would be relatively likely when there were indeed multiple people touching the row in the database at close together times.

Fred

Nate Wiger wrote:

2) Optimized updates

Currently, if one attribute is changed in an AR model, then the entire
model is parsed and sent as an UPDATE statement. This is not really
desirable from a network or database standpoint.

There are already some counters in AR for monitoring changes. Is
there a per-attribute @changed data structure? It seems like this
could just be set to true when updating a column. Then, on save, only
those with the @changed flag are actually converted to SQL.

It isn't quite the solution to the problem, just a hack around it, but
it solves the problem of saving just the pieces of a model I need. See:

http://lists.rubyonrails.org/pipermail/rails/2006-May/039499.html

But that didn't work for me exactly. I had to include some of the model
associations and some of the columns that require validations, like
"enforces_length_of :title" for instance. Adding that to the :select =>
[] and doing an update_attribute instead of a object.save successfully
saved just want I wanted to without saving the entire model object.

position = params[:position]
if section = @stuff.sections.find(params[:id], :select => 'sections.id,
sections.bogus_id, sections.title, sections.position')
  section.update_attribute(:position, position) if section.position !=
position
end

This is what I ended up with. All of these objects are owned and edited
by individual users of the system so I think its safe to assume sure
there won't be any concurrency issues.

Brian Mcquay wrote:

It isn't quite the solution to the problem, just a hack around it, but
it solves the problem of saving just the pieces of a model I need. See:
...

I posted my previous comment prematurely. Upon further testing it isn't
working properly. Please disregard my previous comments.

>> This is certainly a risk, but it's much less likely than it seems at
>> first.

> In distant, but painful memory, it turned out to be *more* likely than
> it seemed at first. Likely enough, in that particular scenario, that
> it actually happened once in a couple of weeks. And by virtue of being
> unreproducible, this bug caused a lot of hurt.

Given that the entire problem that optimistic locking is trying to
solve is multiple people fiddling with something at the same time
you'd think the odds of a 'collision' would be relatively likely when
there were indeed multiple people touching the row in the database at
close together times.

Fred

I've never heard it put so well. I'm going to steal this if you don't
mind. :wink:

@DHH:

Worth noting is that the dirty tracking that facilitates this relies
on no changes in place. So post.title << "more" won't trigger a dirty
mark and thus won't include post.title in an update. Which is another
reason to make sure that existing apps don't get the partial updates
before they're ready for it.

That's pretty drastic (in a good way :wink: ). Ruby would be a nicer
place (sometimes) with the option of cheap, immutable Strings of
course. I'm in the same boat... there's several options, and none of
them nice. Original-value comparison (memory requirements go up, slows
down having to #dup everything), immutable conventions (not "safe"),
writing your own immutable string (what a rabbit-hole!)?

If you can manage to persuade the community on ignoring the mutable
methods by convention, I'll be happy to follow your lead... :smiley: