2 ActiveRecord features I can offer: eager loading, optimized updates

Hey all,

So there are a couple features that my team would find useful in ActiveRecord, that I can volunteer our time to work on. I mainly want to see if there's going to be support in the Rails core team to these changes into core before doing a full dev cycle on them.

Please let me know if you support these ideas. If so, my team can work on implementing them.

1) Eager association loading

Currently, if you have a model like this:

   class Forum < AR      has_many :posts    end    class Post < AR      belongs_to :forum    end

Then walk thru these records:

   <% for forum in @forums %>       Title: <%= forum.post.subject %>    <% end %>

Currently, this will do an extra SQL select * from post for each loop.

DataMapper (datamapper.org) has an interesting approach: The first time that a sub-association is loaded, then all associations are preloaded, since most times you're in a loop (as above). While the :include option helps with this somewhat, when passing collection objects around, you often don't know how the associations are going to be used.

2) Optimized updates

Currently, if one attribute is changed in an AR model, then the entire model is parsed and sent as an UPDATE statement. This is not really desirable from a network or database standpoint.

There are already some counters in AR for monitoring changes. Is there a per-attribute @changed data structure? It seems like this could just be set to true when updating a column. Then, on save, only those with the @changed flag are actually converted to SQL.

These are changes we really need for our local copy anyways. Thoughts?

Thanks, Nate

My forum example was terrible. Here's what I meant:

   <% for post in @posts %>       Title: <%= post.forum.title %>    <% end %>

Not a perfect example, but you get the idea. Ideally, you'd do one bulk select, similar to how :include works, on the first missing record

   select * from forums where id in (12, 157, 2 , 34)

You would get the id list from a distinct list of all values in post.forum_id

-Nate

Currently, this will do an extra SQL select * from post for each loop.

DataMapper (datamapper.org) has an interesting approach: The first time that a sub-association is loaded, then all associations are preloaded, since most times you're in a loop (as above). While the :include option helps with this somewhat, when passing collection objects around, you often don't know how the associations are going to be used.

There's certainly value to be found here. I've definitely been in the case where one consumer of a method or an association would quite like a whole bunch of things included, but the other wouldn't.

2) Optimized updates

Currently, if one attribute is changed in an AR model, then the entire model is parsed and sent as an UPDATE statement. This is not really desirable from a network or database standpoint.

There are already some counters in AR for monitoring changes. Is there a per-attribute @changed data structure? It seems like this could just be set to true when updating a column. Then, on save, only those with the @changed flag are actually converted to SQL.

I can certainly see where you're coming from and something similar has often crossed my mind. The big issue I have with this one is consistency: Currently I can look at an object, call save! and I know that what's in the db is what I just saw, whereas if you only changed the attributes you touched then you can't be sure of this. There's cases where the right thing to do is probably just to use locking of one form or another, but I can't help but feel that's a bit overkill for the simple case.

Fred

+1 for this, be sure to read Pratik’s writeup http://m.onkey.org/2007/12/18/partial-updates

It couldn’t replace current behavior right away (Frederick explained why), but it could have its own option (or even a method).

These should, in my opinion, do partial updates by default: update_attribute, update_attributes.

Agreed, especially update_attribute. The name just screams "I'm going to change one attribute" but then it resaves the whole object.

Fred

I don't think we'd want to change this default unconditionally. Perhaps update_attribute(s) should do a partial update iff the table had a lock attribute, or the model otherwise declared the desire for partial update.

Interesting post by Pratik. I don't know how much real-world impact it would really have, though. I've never run into the situation where I'm save()'ing a record, followed by doing lots more stuff with it. Normally you redirect and go to another resource.

Nevertheless, I understand why we couldn't change the default behavior.

For both of these features, I'd recommend AR::B class settings so the user can control them:

   ActiveRecord::Base.eager_loading = true    ActiveRecord::Base.partial_updates = true

That way, it could be set in environment.rb, and overridden on a per- model basis:

   environment.rb      ActiveRecord::Base.eager_loading = true      ActiveRecord::Base.partial_updates = true

   class CouldHaveInconsistency < AR::Base       partial_updates = false    end

Finally, if a user wanted to do optimistic locking to better ensure their partial_updates are ok, then they would do that simply by adding lock_version as is currently. However, I think the two pieces of functionality should be kept separate, so the user can choose.

Sound good?

Finally, if a user wanted to do optimistic locking to better ensure their partial_updates are ok, then they would do that simply by adding lock_version as is currently. However, I think the two pieces of functionality should be kept separate, so the user can choose.

Sound good?

This has come up a bunch of times here and doesn't tend to get resolved. In hibernate this feature is called 'dynamic-update', and you can have a look at their forums etc to see the potential issues that people hit. It's worth noting that this feature requires optimistic locking.

I'm certainly interested in seeing patches which implement partial updates like this, but it's simply not safe to use unless the object is locked in some way. Otherwise you have *no* way to know that the values in a row correspond to a valid object. This isn't something that rails should allow, the risks of corrupted / invalid data FAR outweigh the difficulty of adding a locking column.

1) Eager association loading

Currently, if you have a model like this:

   class Forum < AR      has_many :posts    end    class Post < AR      belongs_to :forum    end

Then walk thru these records:

   <% for forum in @forums %>       Title: <%= forum.post.subject %>    <% end %>

Currently, this will do an extra SQL select * from post for each loop.

That's not completely true true. ActiveRecord has basic query caching, so (in your example), each forum would only be loaded once. There's been interest in porting my active_record_context plugin into ActiveRecord though. I have a writeup on it at http://activereload.net/2007/5/23/spend-less-time-in-the-database-and-more-time-outdoors. It's basically a simpler, more manual implementation of DM's association eager loading. There's been some work done to the eager including functionality that makes my plugin less useful, but something more automatic would be welcome.

2) Optimized updates

Currently, if one attribute is changed in an AR model, then the entire model is parsed and sent as an UPDATE statement. This is not really desirable from a network or database standpoint.

Jeremy (bitsweat) has a plugin for tracking updates to ActiveRecord attributes: http://code.bitsweat.net/svn/dirty/README It sounds like this will be going into core. With that, it'd make it pretty easy to perform dynamic updates. Koz's concerns are definitely valid though...

DataMapper (datamapper.org) has an interesting approach: The first time that a sub-association is loaded, then all associations are preloaded, since most times you're in a loop (as above). While the :include option helps with this somewhat, when passing collection objects around, you often don't know how the associations are going to be used.

Delayed execution is nice. Definitely worth exploring.

2) Optimized updates

Jeremy Kemper has a dirty tracking plugin that's going into Active Record shortly. That'll solve the first part of the problem (knowing what changed). The second part, just saving what changed, is a little bit more tricky. Records can easily have interdependent fields where a record is only valid if the two fields together fulfill some requirement. If you save only changed fields, you risk ending up with an invalid object. Locking can help mitigate this, but it's something that needs to be considered as part of the implementation.

> 2) Optimized updates

Jeremy Kemper has a dirty tracking plugin that's going into Active Record shortly. That'll solve the first part of the problem (knowing what changed). The second part, just saving what changed, is a little bit more tricky. Records can easily have interdependent fields where a record is only valid if the two fields together fulfill some requirement. If you save only changed fields, you risk ending up with an invalid object. Locking can help mitigate this, but it's something that needs to be considered as part of the implementation.

Actually, I'm coming around on this. The common case is that you don't have validation concerns that implicate multiple attributes. The common case is that it's nice if someone updating the title of a post won't overwrite the separate changes made to the body by someone else. We should certainly make it possible and easy to enforce full- integrity writes, but I think the work should lie with people doing that.

Naturally, we can't just change this as existing implementations may break. But I think we should make partial updates the default for apps created with 2.1. And then finally the default for all by 3.0. We can do this by making an option like config.active_record.partial_updates = true that's only set for apps created post-2.1 and when that option is not set, we complain in the log about the deprecated status.

This should ensure that old apps continue to work just fine and can upgrade to the partial-update style at their leisure while everything new enjoys the revised style.

Worth noting is that the dirty tracking that facilitates this relies on no changes in place. So post.title << "more" won't trigger a dirty mark and thus won't include post.title in an update. Which is another reason to make sure that existing apps don't get the partial updates before they're ready for it.

Naturally, we can't just change this as existing implementations may break. But I think we should make partial updates the default for apps created with 2.1. And then finally the default for all by 3.0. We can do this by making an option like config.active_record.partial_updates = true that's only set for apps created post-2.1 and when that option is not set, we complain in the log about the deprecated status.

I think an ever better and safer option is to add optimistic locking based on updated_at. Most models have this now, and it's a useful field (compared to an opaque incrementing integer). That way we can enable it by default, and not worry about busted validations.

If we really want to we can then add a "I know what I'm doing, do partial updates without a lock" option, but the fact that we'd be the first orm to do this makes me question the sanity of even considering that option :).

Uniqueness of timestamps is not a safe assumption. It will find a way to bite you, eventually.

> I think an ever better and safer option is to add optimistic locking > based on updated_at.

Uniqueness of timestamps is not a safe assumption. It will find a way to bite you, eventually.

Particularly fun when you get the code working smoothly on the dev box, only to fail randomly on more expensive and faster deployment server.

Version numbers work better, handled by optimistic locking, and a cheap way to calculate ETag on your models.

Assaf

This is certainly a risk, but it's much less likely than it seems at first. The particular values of the timestamp column are just opaque keys used for checking no one else has updated it. I certainly agree that a version number is the way to go, but if we wanted to push partial updates out more widely, this would be a low risk way of doing it.

In distant, but painful memory, it turned out to be *more* likely than it seemed at first. Likely enough, in that particular scenario, that it actually happened once in a couple of weeks. And by virtue of being unreproducible, this bug caused a lot of hurt.

Given that the entire problem that optimistic locking is trying to solve is multiple people fiddling with something at the same time you'd think the odds of a 'collision' would be relatively likely when there were indeed multiple people touching the row in the database at close together times.

Fred

Nate Wiger wrote:

2) Optimized updates

Currently, if one attribute is changed in an AR model, then the entire model is parsed and sent as an UPDATE statement. This is not really desirable from a network or database standpoint.

There are already some counters in AR for monitoring changes. Is there a per-attribute @changed data structure? It seems like this could just be set to true when updating a column. Then, on save, only those with the @changed flag are actually converted to SQL.

It isn't quite the solution to the problem, just a hack around it, but it solves the problem of saving just the pieces of a model I need. See:

http://lists.rubyonrails.org/pipermail/rails/2006-May/039499.html

But that didn't work for me exactly. I had to include some of the model associations and some of the columns that require validations, like "enforces_length_of :title" for instance. Adding that to the :select => and doing an update_attribute instead of a object.save successfully saved just want I wanted to without saving the entire model object.

position = params[:position] if section = @stuff.sections.find(params[:id], :select => 'sections.id, sections.bogus_id, sections.title, sections.position')   section.update_attribute(:position, position) if section.position != position end

This is what I ended up with. All of these objects are owned and edited by individual users of the system so I think its safe to assume sure there won't be any concurrency issues.

Brian Mcquay wrote:

It isn't quite the solution to the problem, just a hack around it, but it solves the problem of saving just the pieces of a model I need. See: ...

I posted my previous comment prematurely. Upon further testing it isn't working properly. Please disregard my previous comments.

>> This is certainly a risk, but it's much less likely than it seems at >> first.

> In distant, but painful memory, it turned out to be *more* likely than > it seemed at first. Likely enough, in that particular scenario, that > it actually happened once in a couple of weeks. And by virtue of being > unreproducible, this bug caused a lot of hurt.

Given that the entire problem that optimistic locking is trying to
solve is multiple people fiddling with something at the same time
you'd think the odds of a 'collision' would be relatively likely when
there were indeed multiple people touching the row in the database at
close together times.

Fred

I've never heard it put so well. I'm going to steal this if you don't mind. :wink:

@DHH:

Worth noting is that the dirty tracking that facilitates this relies on no changes in place. So post.title << "more" won't trigger a dirty mark and thus won't include post.title in an update. Which is another reason to make sure that existing apps don't get the partial updates before they're ready for it.

That's pretty drastic (in a good way :wink: ). Ruby would be a nicer place (sometimes) with the option of cheap, immutable Strings of course. I'm in the same boat... there's several options, and none of them nice. Original-value comparison (memory requirements go up, slows down having to #dup everything), immutable conventions (not "safe"), writing your own immutable string (what a rabbit-hole!)?

If you can manage to persuade the community on ignoring the mutable methods by convention, I'll be happy to follow your lead... :smiley: