[Feature Proposal] Allow hashes to be used as common table expressions

I am, unfortunately, not familiar enough with arel to even make a go at this feature.

I have a lot of code that uses common table expressions, but with lists of values. Right now, making that work requires putting the values into a set of select statements that are unioned together, or the use of Postgres array functions.

Let’s say I have some data that I want to use in a common table expression, in this case a list of employee ids and scores:

employee_id,score
1,100
2,95
3,80

I already have an employees table:

# create table employees (id serial, name varchar);
CREATE TABLE
# insert into employees values (1,'Jim'), (2,'Carol'), (3,'Bob');
INSERT 0 3
# select * from employees;
 id | name
----+-------
  1 | Jim
  2 | Carol
  3 | Bob
(3 rows)

If I have the ids and scores in a hash, I can add them to a common table expression using Postgres array functionality:

# select unnest(array[1,2,3]) as employee_id, unnest(array[100,95,80]) as score;
 employee_id | score
-------------+-------
           1 |   100
           2 |    95
           3 |    80
(3 rows)

That allows me to add these to a common table expression and reference them as a table:

# with employee_scores as (select unnest(array[1,2,3]) as employee_id, unnest(array[100,95,80]) as score) select id, name, score from employees inner join employee_scores on employee_id=id;
 id | name  | score
----+-------+-------
  1 | Jim   |   100
  2 | Carol |    95
  3 | Bob   |    80
(3 rows)

In Rails, it would look something like this:

Employee.with(“employee_scores” => “select unnest(array[1,2,3]) as employee_id, unnest(array[100,95,80]) as score”).joins(“inner join employee_scores on employee_id=id”).select(“id”, “name”, “score”)

The array functionality is unique to Postgres, so a more generalized method of doing the same thing:

# select 1 as id, 100 as score
# union
# select 2 as id, 95 as score
# union
# select 3 as id, 80 as score;
 id | score
----+-------
  1 |   100
  3 |    80
  2 |    95
(3 rows)

And that can be used in a common table expression as well.

But there’s another generalized way of handling this:

# with employee_scores (employee_id, score) as (values (1,100), (2,95), (3,80))
select id, name, score from employees inner join employee_scores on employee_id=id;
 id | name  | score
----+-------+-------
  1 | Jim   |   100
  2 | Carol |    95
  3 | Bob   |    80
(3 rows)

The issue here is that the column names must appear between the CTE “table” name and the word “AS”.

I would like to propose that we extend Arel such that a CTE that is an array of hashes, each with a consistent key set, gets turned into that particular format. In this case, it would look like this:

employee_scores = [
  { employee_id: 1, score: 100 },
  { employee_id: 2, score: 95 },
  { employee_id: 3, score: 80 }
]

Employee.with(“employee_scores” => employee_scores).joins(“inner join employee_scores on employee_id=id”).select(“id”, “name”, “score”)

This would create this SQL:

with employee_scores (employee_id, score) as (values (1,100), (2,95), (3,80)) select id, name, score from employees inner join employee_scores on employee_id=id;

This is a general, cross-platform way of handling the CTE elegantly, and it doesn’t clash with the current method of allowing an Arel expression in the “with” clause.

Thoughts?

Sorry, how is this different from a has_one :score association?

There is no association. There are various reasons for using common table expressions, and one is to be able to inject data from the application into a query as if it were in a table. I hope my examples make that clear.

Ok, now I read this more carefully. I think I understand what you mean. I wonder though wouldn’t in most cases be more efficient and readable to just obtain the employee objects and then you have them in your app, no need to push data to the database. With the cte, you are effectively pushing some data to the database that you later query on.

But since that data is already in your app, you can see which employees you need to obtain from DB and then do with them whatever you need.

It is currently hard for me to think of a use case where the CTE would be more efficient or more readable. Also if you work with huge amount of data, it might even be impossible.

So I’m not against this feature. Just trying to understand what are the use cases which would be more efficient this way.

The primary use case is when you have application data that you want to use for more complicated queries, and the data doesn’t exist in a database. CTEs allow you to basically inject the data into the database and use it as a table, but without the need to build a temporary table or any of that.

For an example of a really good use case, imagine you have a set of vectors - each with an associated score - and you’re looking for the nearest vectors in the database and want to order by distance multiplied by the score to obtain the best matches. The query is most easily written using a CTE. But the CTE itself just contains the data, so there’s no reason to have anything other than a set of values.

The primary issue keeping us from being able to do this right now is that the value list CTE requires a list of field names, and the “.with” method has no way of specifying such.

I can see that if you want to use the additional data in some complicated query which not only filters based on application data but both - application and database data.

What you describe though, in you can filter your application data inside your application and then obtain the database records you need with a simple query based on the IDs from the application data.

Also, in case this application data is really big, you may have troubles with such a large query. Because you basically send all this application data to the database within a single query. It has to parse it and then filter based on it, and finally you receive that same data back. Hard to imagine that it would be more efficient than doing within your application.

And if the data is not so much, then you can also hardly gain any efficiency going through the database.

My question is, why don’t you filter the data that is already within the app? And then just obtain your employees Employee.where(id: my_data_filter(my_app_data))

P.S. Although I don’t see it, I expect that there will be some real-world use case where it would make sense. My hope was to understand what that would be.

I don’t want to pollute the list with too much, so this is my last one. I’ve been programming for 45 years now. If there were some easier way to do it - I’d be doing that.

If I have a set of vectors and I want to find a set of close vectors in a database table that calculation has to be done in the database. I’m either sending a few K to the database or a few tens of megabytes back to my application from the database. These vectors are 1500 floats each and there are tens of thousands of them in the database. These calculations must take place there. And, yes, the queries are huge in this case, but that’s not an issue.

This is just one example. The bottom line is that it’s easier to just treat some data as a table for purposes of querying, and that’s one use-case for CTEs. I can get around it now by using pg array functions (suboptimal) or union’d selects (less suboptimal, but, suboptimal).

This seems like a fairly easy way of handling it and it doesn’t conflict with the current “with” method.

1 Like

I said I’d quit, but one other aspect of this to point out is that the CTE as a value list is another valid syntax for a CTE. Whether you can imagine a use case for it or not doesn’t strike me as relevant, although I can understand why a person may choose to devote time to an issue such as this based on its relevance to them personally. But that doesn’t generally speak to the relevance of the issue. It’s in the SQL standard so it’s fair game for inclusion in a SQL generation library.

It’s not the point to quit or argue. The point is to give a real world example so somebody finds interest in it. I don’t think the original example is very relevant although it may explain how the feature technically may work.

What you wrote about calculating vectors sounds more relevant to me. I think if you provide some example it may attract more interest but it is fine with me whatever you choose to do. I don’t really have a horse in this race.