It all began when a collegue said that “our free text search throws if the user enters something with a dot”. A bit of investigation showed that ActiveRecord produced a different set of queries depending on the user input containing a dot or not. The :include is handled with a single query with a join, or with two separate queries.
This is not a big problem in most cases; it only has a performance impact. In our case there was a failure because the :include joins a table which another scope joins for a different reason. We would never have noticed this otherwise.
I dug a bit, and found that the reason is that method ActiveRecord::Base.tables_in_string treats anything before a dot like a table name. So a condition like
[“x = ?”, params[:input]]
can produce a query “x = ‘foo.bar’”, and tables_in_string reports that we reference a table named “foo”.
It’s not a big problem for our application; the simple workaround for me was to filter out dots in user input. But I think it’s not good that AR can behave differently according to user input. So I filed a ticket and prepared a patch.
https://rails.lighthouseapp.com/projects/8994/tickets/3446-tables_in_string-matches-literal-string-containing-dot#ticket-3446-2
Cheers,
Matteo
Attempting to parse SQL using regular expressions may work for simple
cases, but not more complex ones. The following will still fail with
your patch:
irb(main):001:0> puts "name = 'foo\\\\' AND 'foo.bar ' = number"
name = 'foo\\' AND 'foo.bar ' = number
irb(main):002:0> puts "name = 'foo\\\\' AND 'foo.bar ' =
number".gsub(/'(\\'|[^'])*'|"(\\"|[^"])*"/, '')
name = foo.bar ' = number
irb(main):003:0> "name = 'foo\\\\' AND 'foo.bar ' =
number".gsub(/'(\\'|[^'])*'|"(\\"|[^"])*"/,
'').scan(/([a-zA-Z_][\.\w]+).?\./).flatten
=> ["foo"]
To be fair, ActiveRecord attempts to parse SQL using regular
expressions all over the place, and your patch handles the most common
current failure cases, so it's still probably a good candidate for
inclusion.
Jeremy
You are right. I improved the regexp and updated the ticket with a new
patch. Now it will treat any character after a backslash as part of
the string literal: /'(\\.|[^'])*'|"(\\.|[^"])*"/
I agree that parsing SQL correctly with regexps is hopeless. But the
string literal syntax should be well within the possibilities of a
finite-state automa. I believe this should work... I'll be happy to
be refuted
I have a few question, being an absolute newbie to this list... was it
bad manners on my part to assign the ticket to someone? I did choose
Pratik because he handled a very similar ticket some time ago.
Also, is it OK to add unit tests for a private method?
Matteo
You are right. I improved the regexp and updated the ticket with a new
patch. Now it will treat any character after a backslash as part of
the string literal: /'(\\.|[^'])*'|"(\\.|[^"])*"/
It's still going to be broken on PostgreSQL if
standard_conforming_strings is on, and potentially other databases too
if they conform to the SQL standard (where backslashes are not
special):
a = 'foo\' AND 'foo.bar ' = a
In standard SQL, 'foo\' and 'foo.bar' are two separate string
literals. Your remove_literal_strings converts it to:
a = foo.bar ' = a
I agree that parsing SQL correctly with regexps is hopeless. But the
string literal syntax should be well within the possibilities of a
finite-state automa. I believe this should work... I'll be happy to
be refuted
It'll work on some databases and not others, because of the
differences in SQL parsers. On PostgreSQL, whether it works depends
on the standard_conforming_strings setting. You could make it adapter
dependent, I suppose.
Also, in the SQL standard double quotes specify identifiers, not
strings, so remove_literal_strings is a misnomer, since it also
removes quoted identifiers. Whether backslashes are allowed and how
they are handled inside quoted identifiers is another thing that is
going to vary depending on the database used.
Maybe since ActiveRecord is getting a major bump to 3.0, it's time to
break backwards compatibility and use a different .find hash key for
preloaded includes (which use separate queries) vs. regular includes
(which use joins). That would avoid the issue of scanning strings
looking for whether or not tables are referenced in order to decide
whether to preload them.
Jeremy