How to deal with big tables in DB


I’m working on a project where we have a simple relation let’s say → :user has many through :photos. Photo could be in some category.

User can do different things with photos (for example “like”, “repost”, etc.)

On this project we would need to do some analytics around put users in different cohorts based on their activity with photos.

So, back to original question. We currently have 50M users and 200M photos. And we plan to import data about likes/reposts/etc from another source to our DB (currently Postgres).

Question is it the right option for us, I mean Postgres? Because for example “count” operation is taking ~20-30 seconds even now. We expect to have maybe a billions records in this “join” table.

What would you recommend to use? We need 1) store data 2) search and analyze

I’m just curious if your “count” if your doing a “select count(*)” every time a like/dislike is added? I’d assume if you added a cache counter column with the counts allocated for each photo it should be quite a bit faster.

Oh, it’s not so simple. It won’t help here

I’ve a big different schema. But queries are similar to what I told before. More interesting is amount of data. I’m just not sure how good is idea to have a join table with 1 billion records there.

It would be helpful to know (to answer the question):

  1. a rough schema of the relevant tables
  2. examples of the queries
  3. how realtime and exact queries results should be
  4. how often it is expected to run these queries

50M, 200M, 1B records - are not way too large tables and proper indexing and configuration may solve the problem, depending on the answers to the questions above. Some people would suggest to consider analytical databases, like ClickHouse.

You can also try having appropriate database indexes in place for quick response from DB. Which colums you index would depend on which columns you commonly use in your Active Record queries/joins. Also, use pagination wherever possible.

This sounds like, you need that big join for analytics?

With postgresql, a simple join is really fast. You might want to help by creating some indexes.

But when making a big big join with a lot of tables involved, then even the fastest machine might need some time to compute that.

We do this with materialized views. We update them via cron.

We first build an sql view to declare the rules, and then save the result to a materialized view. (and refresh the material view via cron as frequenly as needed)

Both, wither the live view or the materialized view are ready made acessible in Rails.

The materialized views are readonly, You better know, when updating via views.