Handling Hot Data with ActiveRecord

Hello!

We are thinking of different ways to handle hot data in our system. We get a lot of information that's very relevant for periods of time. So, for example, we have some data that we query on the basis of
> counts over last 30 minutes
> data collected in the last 30 minutes
> items collected today
> items collected within the past week
...and so on.

We are looking at different strategies to keep this data updated and to manage scaling the database. So, we are looking at horizontal sharding [splitting data into multiple tables that inherit from a master and have exclusion constraints] and are also considering having tables that hold data as:
> All the data
> Data for the last 24 hours
> Data for the last 2 hours
That way, when we want something recent, we would just query the most recent table, but in the few occasions that we need something more, we go to the larger tables that are sharded.

Just for reference, we are doing something like this:
select count(*), count(distinct group_id) from
(select group_id, recorded_on from data_store_v2 order by id DESC limit 900000) td
where recorded_on >= (now() - '24 hr'::INTERVAL);

We are getting 800,000 data items a day right now and the above query takes around 14 seconds on a single table that is not sharded and has around 268million records. Every week, this table becomes slightly slower since we add close to 6 million records every week.

I've read this: https://www.amberbit.com/blog/2014/2/4/postgresql-awesomeness-for-rails-developers/ and am looking at ways that everything is managed under Rails, if possible.

So, the questions (and thanks for reading this far) are:
> What is a good way to do this while still working within ActiveRecord?
> This is a capability that we'd like to attach to any model that might need it. What would be a good way to approach a gem for it?

I'm sure there will be a few more questions as we progress on this.

Thanks,
Mohit.

First I would work to learn patterns in time increments, which occur most often, and plan from there. Then I would design with regard to DB code and servers, optimal Master-slave replication. Pretty sure that in replication WHERE clauses like (where recorded_on >= (now() - ‘24 hr’::INTERVAL) could be eliminated at the user level. But I am not certain.

Hi Elizabeth,

Thank you for replying.

First I would work to learn patterns in time increments, which occur most often, and plan from there.

We do know that - we have a few peak hours when more data comes in, and then there are hours (like night time) where the data drops to a trickle for some of the records. However, there are some status emails, etc. that come at a fixed frequency - few times an hour, irrespective of the hour.

Then I would design with regard to DB code and servers, optimal Master-slave replication. Pretty sure that in replication WHERE clauses like (where recorded_on >= (now() - '24 hr'::INTERVAL) could be eliminated at the user level. But I am not certain.

While this is something that might help, I'm more looking at solutions that involve having hot data tables in the manner:
  > All the data
  > Data for the last 24 hours
  > Data for the last 2 hours
That way, when we want something recent, we would just query the most recent table, but in the few occasions that we need something more, we go to the larger tables that are sharded.

Best Regards,
Mohit.

Some suggestions in a slightly different direction:

1) rewrite without that completely unnecessary subquery, and see if the query time improves :wink:

2) check that appropriate indexes exist and are being used by the optimizer

3) really analyze the query execution plan, and look for more advanced opportunities, for instance, order by recorded_on instead of id, since, presumably, the query will already access the rows by an index on ordered_on; consider dropping the limit & order altogether; take that query to a PostgreSQL list...

4) put some effort into learning SQL better; OK, we all make mistakes sometimes and maybe this is just that; but it sure looks to me like someone who doesn't really understand SQL struggling at throwing together various clauses until the correct answer pops out (note that in addition to the issue I pointed out in 1, the subquery is selecting a column which is completely unused--probably doesn't affect anything, but just another sign that the person writing the query did not understand it).

Amen! Amen! To what Scott has written. I actually wanted to offer an alternative SQL procedure but thought that I shouldn’t as out of context with ROR…

Hi Scott,

Thanks for your email. Your inputs are certainly useful.

select count(*), count(distinct group_id) from
(select group_id, recorded_on from data_store_v2 order by id DESC limit 900000) td
where recorded_on >= (now() - '24 hr'::INTERVAL);

Some suggestions in a slightly different direction:

1) rewrite without that completely unnecessary subquery, and see if the query time improves :wink:

We have tried this and the query is quite a bit slower. Filtering to the last 900k records before doing the recorded_on part helped speed it up.

2) check that appropriate indexes exist and are being used by the optimizer

3) really analyze the query execution plan, and look for more advanced opportunities, for instance, order by recorded_on instead of id, since, presumably, the query will already access the rows by an index on ordered_on; consider dropping the limit & order altogether; take that query to a PostgreSQL list...

Your email got me going back to look at all the parts again since obviously the query should be using the index and it was still slow. Further search last night made me realize that it's not the indexes that are a problem. The problem is the count(distinct group_id) part which seems to be quite slow in PostgreSQL. This is a lot faster:
select count(*) from
(select distinct group_id from
data_store_v2 where recorded_on >= '2015-06-06') td;
than:
select count(distinct group_id) from
data_store_v2 where recorded_on >= '2015-06-06';

as explained here: https://www.periscope.io/blog/use-subqueries-to-count-distinct-50x-faster.html

So, I guess the real problem was being masked by something else and an incorrect assumption on my part :slight_smile:

4) put some effort into learning SQL better; OK, we all make mistakes sometimes and maybe this is just that; but it sure looks to me like someone who doesn't really understand SQL struggling at throwing together various clauses until the correct answer pops out (note that in addition to the issue I pointed out in 1, the subquery is selecting a column which is completely unused--probably doesn't affect anything, but just another sign that the person writing the query did not understand it).

Thanks for the analysis :smiley:
I do understand SQL and I thought I'm not throwing things together... for my understanding, which column was unnecessary? I thought we needed all:
> group_id for counting the distinct group_id
> recorded_on for the subsequent query on it
> id only for getting the most recent records

Most of my personal work is with SQLite3 on embedded platforms, but this discussion resolves the problem for now. It now moves away from being a Rails issue to being a PostgreSQL issue.

Best Regards,
Mohit.

We have tried this and the query is quite a bit slower. Filtering to the last 900k records before doing the recorded_on part helped speed it up.

I don't understand how that could possibly be the case if there's an index on recorded_on.

Your email got me going back to look at all the parts again since obviously the query should be using the index and it was still slow. Further search last night made me realize that it's not the indexes that are a problem. The problem is the count(distinct group_id) part which seems to be quite slow in PostgreSQL. This is a lot faster:
select count(*) from
(select distinct group_id from
data_store_v2 where recorded_on >= '2015-06-06') td;
than:
select count(distinct group_id) from
data_store_v2 where recorded_on >= '2015-06-06';

as explained here: https://www.periscope.io/blog/use-subqueries-to-count-distinct-50x-faster.html

So, I guess the real problem was being masked by something else and an incorrect assumption on my part :slight_smile:

I would expect the select count(distinct...) to be a major contributor to the time taken by the query, just given the amount of work it must do. The select count(*) from (select distinct...) alternative is a nice tip :slight_smile:

Thanks for the analysis :smiley:
I do understand SQL and I thought I'm not throwing things together... for my understanding, which column was unnecessary? I thought we needed all:
> group_id for counting the distinct group_id
> recorded_on for the subsequent query on it
> id only for getting the most recent records

I apologize--I misread the query structure. I got it into my head as:

select ... from (select ... from ... where recorded_on ... order by ... limit ...)

I think you can see how THAT query would have better fit my description of being poorly constructed.

I'm glad that my somewhat off-base pontification still managed to point you in a useful direction!

Hi Scott,

We have tried this and the query is quite a bit slower. Filtering to the last 900k records before doing the recorded_on part helped speed it up.

I don't understand how that could possibly be the case if there's an index on recorded_on.

Because it was the count(distinct x) that was the problem :slight_smile:
Doing only a count(*) is faster without the subquery... and is what we have switched to.

Your email got me going back to look at all the parts again since obviously the query should be using the index and it was still slow. Further search last night made me realize that it's not the indexes that are a problem. The problem is the count(distinct group_id) part which seems to be quite slow in PostgreSQL. This is a lot faster:
select count(*) from
(select distinct group_id from
data_store_v2 where recorded_on >= '2015-06-06') td;
than:
select count(distinct group_id) from
data_store_v2 where recorded_on >= '2015-06-06';

as explained here: https://www.periscope.io/blog/use-subqueries-to-count-distinct-50x-faster.html

So, I guess the real problem was being masked by something else and an incorrect assumption on my part :slight_smile:

I would expect the select count(distinct...) to be a major contributor to the time taken by the query, just given the amount of work it must do. The select count(*) from (select distinct...) alternative is a nice tip :slight_smile:

...and that is what the cause was.

Thanks for the analysis :smiley:
I do understand SQL and I thought I'm not throwing things together... for my understanding, which column was unnecessary? I thought we needed all:

group_id for counting the distinct group_id
recorded_on for the subsequent query on it
id only for getting the most recent records

I apologize--I misread the query structure. I got it into my head as:

select ... from (select ... from ... where recorded_on ... order by ... limit ...)

I think you can see how THAT query would have better fit my description of being poorly constructed.

I'm glad that my somewhat off-base pontification still managed to point you in a useful direction!

Yes, thanks again.

Best Regards,
Mohit.

So does the limit reduce it to less than 1 day's rows?

Hi Scott,

Because it was the count(distinct x) that was the problem :slight_smile:
Doing only a count(*) is faster without the subquery... and is what we have switched to.

So does the limit reduce it to less than 1 day's rows?

Yes, the idea of the limit (900k records) was to limit it to the records within 1 day and then run the query on the smaller record set. However, after your comments, when I was looking around, I resolved that the slowness is because of the count (distinct x) and not because of the indexes, etc.

For reference, on a 230million record table, the numbers were roughly along these lines:
* SELECT count(*), count (distinct group_id) with a limit of 900k records based on ID DESC, followed by the recorded_on part = 8.6 seconds
* SELECT count(*), count (distinct group_id) on the whole table using only recorded_on in the WHERE = 14 seconds
* SELECT count(*) only with a limit of 900k records based on ID DESC, followed by the recorded_on part = 700ms
* SELECT count(*) on the whole table using only recorded_on in the WHERE = 350ms
--> Clearly, the culprit was the count (distinct group_id) - that benefits a lot by using a subquery to limit the number of records it considers
* SELECT count(*) from (select distinct group_id from data_store_v2 where recorded_on >= '') --> takes around 900ms

So, we are combining these in the final query now... this takes around 900ms to get both values (count and count distinct)
-- get the fields from 2 different subqueries
select * from
-- first field is got for the count(*)
(select count(*) AS all_count from data_store_v2 where recorded_on >= (now() AT TIME ZONE 'Asia/Singapore' - '24 hr'::INTERVAL)) as t1,
-- and Joining in the second one for the count (distinct group_id)
(select count(*) from
     -- this is yet another subquery
     (select distinct drive_id from data_store_v2 where recorded_on >= (now() AT TIME ZONE 'Asia/Singapore' - '24 hr'::INTERVAL)) td ) as t2;

But as I mentioned, this is now a PostrgreSQL question not an ActiveRecord or Rails question :slight_smile:

Thanks for digging with me!

Best Regards,
Mohit.

That's kind of what I expect. That adding the count(distinct...) makes them slower is not a surprise, but I was surprised that adding the count(distinct...) inverts the relative performance of those two.

So, anyway, it sounds like you might have gotten it fast enough. If not, bounce the discussion over to pg's general mail list, and we can talk about how to maintain a summary table without going through the hassle of full-on sharding :wink: