references_eager_loaded_tables? matches string literals containing dots

It all began when a collegue said that “our free text search throws if the user enters something with a dot”. A bit of investigation showed that ActiveRecord produced a different set of queries depending on the user input containing a dot or not. The :include is handled with a single query with a join, or with two separate queries.

This is not a big problem in most cases; it only has a performance impact. In our case there was a failure because the :include joins a table which another scope joins for a different reason. We would never have noticed this otherwise.

I dug a bit, and found that the reason is that method ActiveRecord::Base.tables_in_string treats anything before a dot like a table name. So a condition like

[“x = ?”, params[:input]]

can produce a query “x = ‘foo.bar’”, and tables_in_string reports that we reference a table named “foo”.

It’s not a big problem for our application; the simple workaround for me was to filter out dots in user input. But I think it’s not good that AR can behave differently according to user input. So I filed a ticket and prepared a patch.

https://rails.lighthouseapp.com/projects/8994/tickets/3446-tables_in_string-matches-literal-string-containing-dot#ticket-3446-2

Cheers,

Matteo

Attempting to parse SQL using regular expressions may work for simple cases, but not more complex ones. The following will still fail with your patch:

irb(main):001:0> puts "name = 'foo\\\\' AND 'foo.bar ' = number" name = 'foo\\' AND 'foo.bar ' = number irb(main):002:0> puts "name = 'foo\\\\' AND 'foo.bar ' = number".gsub(/'(\\'|[^'])*'|"(\\"|[^"])*"/, '') name = foo.bar ' = number irb(main):003:0> "name = 'foo\\\\' AND 'foo.bar ' = number".gsub(/'(\\'|[^'])*'|"(\\"|[^"])*"/, '').scan(/([a-zA-Z_][\.\w]+).?\./).flatten => ["foo"]

To be fair, ActiveRecord attempts to parse SQL using regular expressions all over the place, and your patch handles the most common current failure cases, so it's still probably a good candidate for inclusion.

Jeremy

You are right. I improved the regexp and updated the ticket with a new patch. Now it will treat any character after a backslash as part of the string literal: /'(\\.|[^'])*'|"(\\.|[^"])*"/

I agree that parsing SQL correctly with regexps is hopeless. But the string literal syntax should be well within the possibilities of a finite-state automa. I believe this should work... I'll be happy to be refuted :slight_smile:

I have a few question, being an absolute newbie to this list... was it bad manners on my part to assign the ticket to someone? I did choose Pratik because he handled a very similar ticket some time ago.

Also, is it OK to add unit tests for a private method?

Matteo

You are right. I improved the regexp and updated the ticket with a new patch. Now it will treat any character after a backslash as part of the string literal: /'(\\.|[^'])*'|"(\\.|[^"])*"/

It's still going to be broken on PostgreSQL if standard_conforming_strings is on, and potentially other databases too if they conform to the SQL standard (where backslashes are not special):

a = 'foo\' AND 'foo.bar ' = a

In standard SQL, 'foo\' and 'foo.bar' are two separate string literals. Your remove_literal_strings converts it to:

a = foo.bar ' = a

I agree that parsing SQL correctly with regexps is hopeless. But the string literal syntax should be well within the possibilities of a finite-state automa. I believe this should work... I'll be happy to be refuted :slight_smile:

It'll work on some databases and not others, because of the differences in SQL parsers. On PostgreSQL, whether it works depends on the standard_conforming_strings setting. You could make it adapter dependent, I suppose.

Also, in the SQL standard double quotes specify identifiers, not strings, so remove_literal_strings is a misnomer, since it also removes quoted identifiers. Whether backslashes are allowed and how they are handled inside quoted identifiers is another thing that is going to vary depending on the database used.

Maybe since ActiveRecord is getting a major bump to 3.0, it's time to break backwards compatibility and use a different .find hash key for preloaded includes (which use separate queries) vs. regular includes (which use joins). That would avoid the issue of scanning strings looking for whether or not tables are referenced in order to decide whether to preload them.

Jeremy