Rails validation is inefficient

Hi,

I have a User model, with a validates_uniqueness_of :login

The generated SQL for the validation is: SELECT * FROM `users` WHERE (LOWER(users.login) = 'fernando' AND users.id <> 10001) LIMIT 1; and it takes 0.13s to happen (my table has 10.000 rows)

When I use the EXPLAIN instruction on it, it shows that it will use the primary_key as index, but this is not efficient.

With an index on login and the following MySQL query: SELECT * FROM `users` WHERE (users.login = 'fernando' AND users.id <> 10001) LIMIT 1 It takes less than 0.001s and the EXPLAIN gives me: primary_key and login.

How can I remove this "LOWER" instruction from MySQL which is unefficient, I can code that directly in Ruby.

Hi, How can I remove this "LOWER" instruction from MySQL which is unefficient, I can code that directly in Ruby.

Play with the :case_sensitive option

Fred

No that doesn't work. I want to have case insensitive validation.

But instead of having it plugged into the MySQL query with LOWER(), I want to have ruby do login.downcase, then pass it to MySQL.

Looks like you indeed want case-sensitive SQL with normalized values. As Frederick suggests :case_sensitive => true should do if that's the case. I guess logins are normalized in the database because otherwise that login.downcase approach wouldn't make sense.

BTW http://dev.rubyonrails.org/changeset/9248 is relevant.

Xavier Noria wrote:

No that doesn't work. I want to have case insensitive validation.

There are several things you may wish to consider here.

1. Rails "validates_uniqueness_of" is a trap that will, over time, eventually permit duplicate values in your database. The only way to prevent this is to have an INDEX UNIQUE clause on the column that you wish to enforce the uniqueness of. This constraint properly lives in the DBMS and not in the application logic in any case, regardless of the expressed opinions of others.

The way that I handle this in my migrations is to do this inside the .up method:

14 add_index :table, :column, 15 :name => :idxU_table_column, 16 :unique => true

2. Removing "validates_uniqueness_of" gets rid of the auto-generated LOWER in the SQL but that still leaves the problem of denormalized (mixed-case) entries colliding on the same normalized value (lowercase). The way that I dealt with this was to write a keycase method and inject it into string by placing the file containing this code into ./config/initializers:

1 class String 2 # remove extra whitespace, force left, and down shift letters 3 def keycase 4 strip.squeeze(" ").downcase 5 end 6 end

Now, in your model file just use this:

75 # override assignment 76 def column_attribute=(name) 77 # keycase is a local extension of class String. 78 write_attribute(:column_attribute, name.keycase) 79 end

You can use the .titlecase method from ActiveSupport on the normalized data in the views if you desire a prettier output than straight lowercase.

The advantage of this approach over simply doing everything in the Rails application is that you allow the DBMS planner to make the best use of the index and you only store normalized data, greatly simplifying the api for non-Rails clients.

HTH Jim

Since you're using MySQL:

http://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html#operator_equal

then scroll up a few lines to see:

"By default, string comparisons are not case sensitive and use the current character set. The default is latin1(cp1252 West European), which also works well for English."

So let ActiveRecord think that the comparison is case sensitive and needs no LOWER() and MySQL will do what you want.

(And add a comment to your code and a test to prove it so you'll know if MySQL changes its behavior in the future.)

-Rob

Rob Biedenharn http://agileconsultingllc.com Rob@AgileConsultingLLC.com

Is it a bug that Rails does not add BINARY?

Hey James you actually pinpointed-out a a huge failure in Rails that I hadn't figured out concerning the duplicate values that validates_uniqueness_of won't detect. And yet the Rails bubble constantly brags about the benefits of TDD/BDD...

Thank you Rob for the link, it works as expected.

Hey James you actually pinpointed-out a a huge failure in Rails that I hadn't figured out concerning the duplicate values that validates_uniqueness_of won't detect. And yet the Rails bubble constantly brags about the benefits of TDD/BDD...

What does TDD have to do with a validation??

Add a :uniq to the migration that creates your table...

Oh, it is not that tests fails to identify it is broken, it is known that it implies a race condition. You could argue that in that case it shouldn't exist in the first place, but that's a different issue.

Using the unique index technique, I have had to overwrite the save method according to this wiki page: http://wiki.rubyonrails.org/rails/pages/HowtoHandleStatementInvalid

How can I detect that the error is a duplicate entry problem? I don't to show any other error message to the user if the error is not a duplicate entry.

You could argue that in that case it shouldn't exist in the first place, but that's a different issue.

Yeah, that's what I was thinking about. Okay the race condition has 0.0000001% chances to happen, but as long as this chance exist, it is unacceptable, and therefore validates_uniqueness_of shouldn't exist in its current state, specially as we are actually rewriting it to handle this kind of problem.

We could probably even submit a patch if it hasn't been fixed in Rails 2.1 RC1 yet?

Problem is the validation is done outside the database, I don't think it is "fixable". Also in the database you can't express :scope for example... I believe there's a trade-off in the programmer's side here. The docs explain the race condition and the validation may still have some valid uses cases that accept the risk.

The docs explain the race condition and the validation may still have some valid uses cases that accept the risk.

Which docs? I didn't find anything on the rails api: http://api.rubyonrails.org/classes/ActiveRecord/Validations/ClassMethods.html#M001330

Anyway using the unique index feature of MySQL introduces a little problem. Which is the validation of this unique field doesn't show up with the other Rails validations. It is only when all other Rails validations are passed that the validation of the unique index appears, this is because previous validations don't try to save or update the record.

Yeah, check the paragraph that starts with "Because this check is performed outside".

Perhaps that paragraph could also say something about the gotcha you mention.

Xavier Noria wrote:

Problem is the validation is done outside the database, I don't think it is "fixable". Also in the database you can't express :scope for example... I believe there's a trade-off in the programmer's side here. The docs explain the race condition and the validation may still have some valid uses cases that accept the risk.

The problem is that on the alter of "agnosticism" the ability to trap and report DBMS errors has been "sacrificed". Rails is a General Purpose web application development tool. To be as attractive as possible to the widest audience its authors have elected, quite sensibly in my opinion, to provide 80% of the value for 20% of the effort. So SQL support is reduced to the very lowest common denominator (SQLite).

What it comes down to is that the SQL-92/99/03 standard as to what a DBMS must report for certain error conditions, SQLSTATE, is not uniformly implemented. In SQLite3 for instance, a duplicate index value reports some nondescript generic error message similar to the following:

"SQL logic error or missing database"

This is really not very much help outside of the context of the INSERT call so that determining the exact nature of the error is perforce left to the application; in other words: "validates_uniqueness_of".

I digress for a moment, but it seems to me that the proper way to handle this in Rails would have been to INSERT first, trap any error and THEN do a find on the index to see if the error was caused by a duplicate. If a duplicate is found and is different (based on hash contents of inserted row less id attribute versus hash of returned row lest id attribute?) than the row just inserted then report a validation error; if not then raise the original dbms error. That way the race condition would never occur and there would, on average, be       ((1 / <expected collision rate>) - 1) fewer SQL reads.

Returning to the main stream of thought, PostgreSQL 8 returns this:

"ERROR: Cannot insert a duplicate key into unique index".

While MySQL 5 can return two different messages, depending on whether one is using NDB or not:

NDB: ERROR 1169 (23000): Can't write, because of unique constraint, to table <table>

not NDB: ERROR 1062 (23000): Duplicate entry <entry> for key <key>

Note: The common element in these latter two messages, (23000) simply tells us the SQLSTATE falls under "Integrity constraint violation".

The exact error code that should be returned in this case is (23505) which is exactly what PostgreSQL [sqlca.sqlstate], DB2 [sqlca.sqlstate], and MySQL [mysql_sqlstate()] do return, they just choose not to display it in the human readable message. This sort of error trapping stuff all should live in the DB adapters really, but that poses serious coding problems for Rails when accessing DBMS that do not support the full SQLSTATE implementation; SQLite3 for instance, which seemingly returns SQLSTATE=HY000 for everything that goes wrong.

My suggestion is that you seriously consider what DBMS best supports the production requirements for your application and use it for development and testing. Your design is in any case going to reflect the specific implementation details of the DBMS you select, so pick the DBMS first and after you decide do not worry over much about the other possiblities.

When convenient, use SQLite3 to sketch out design ideas and to code proof of concepts, but do not break your heart trying to get a single-user DBMS to act like industrial strength software. Similarly, while MySQL has its many adherents that advocate its suitability for production use I am not one of them.

Remember that in the vast majority of business applications, it is the data that is valuable and not the present method of its presentation. The choice as to where the data goes and how it is handled, verified and secured from corruption, tampering and loss, trumps the latest whiz-bang programming paradigm every day of the week. Remember too, that any significant persistent data store is eventually going to be accessed by more than one application and by more than one programming language in the course of its existence.

James Byrne wrote:

I digress for a moment, but it seems to me that the proper way to handle this in Rails would have been to INSERT first, trap any error and THEN do a find on the index to see if the error was caused by a duplicate. If a duplicate is found and is different (based on hash contents of inserted row less id attribute versus hash of returned row lest id attribute?) than the row just inserted then report a validation error; if not then raise the original dbms error. That way the race condition would never occur and there would, on average, be       ((1 / <expected collision rate>) - 1) fewer SQL reads.

Unstated above, but necessary and implied, is that uniqueness constraints are and must be enforced at the DBMS level. Rails simply checks for conformance.

I'm sort of working on a plugin to enforce it through the validation mechanism in a way (haven't had much time to hack on it lately). I'll Github it when it's done (if it ends up working like I think it will).

--Jeremy

Thank you James for your very informative message.

Quoting Fernando Perez <rails-mailing-list@andreas-s.net>:

Hi,

I have a User model, with a validates_uniqueness_of :login

The generated SQL for the validation is: SELECT * FROM `users` WHERE (LOWER(users.login) = 'fernando' AND users.id <> 10001) LIMIT 1; and it takes 0.13s to happen (my table has 10.000 rows)

When I use the EXPLAIN instruction on it, it shows that it will use the primary_key as index, but this is not efficient.

With an index on login and the following MySQL query: SELECT * FROM `users` WHERE (users.login = 'fernando' AND users.id <> 10001) LIMIT 1 It takes less than 0.001s and the EXPLAIN gives me: primary_key and login.

How can I remove this "LOWER" instruction from MySQL which is unefficient, I can code that directly in Ruby.

Is your Web site so heavily loaded that it is more cost efficient to spend expensive programmer time tweaking the code for creating a new and unique login name than simply leave it alone or add more cheap hardware?

Note: validations can be restricted to just certain actions, e.g.

  validates_uniqueness_of :login, :on=>:create

This may be much more Railish/Rubyish, takes less programmer time, and is less likely to be broken by new Rails & Ruby versions.

Just my $0.02USD,   Jeffrey

Is your Web site so heavily loaded that it is more cost efficient to
spend expensive programmer time tweaking the code for creating a new and
unique login name than simply leave it alone or add more cheap hardware?

Note: validations can be restricted to just certain actions, e.g.

validates_uniqueness_of :login, :on=>:create

I'm sure Jeffrey knows this, but just wanted to make this explicit for
others (as there is sometimes confusion about this). The :on option
refers to whether an object is being created, updated or saved (which
covers the 2 previous options and is the default. It has nothing to do
with the controller action. You can call your create action bananarama
and :on => :create validations will still fire (it pretty much has to
be this way - your models should not know or care about stuff that
happens way above them).

Fred