So many users, so little time - looking for a better way

OK, time to come clean. Right now my app is doing something really cheesy. The first step is alway admitting that you have a problem, right?

So here's the deal. One of the features of my app is that, based on some user settings, it is supposed to send out "reminder" emails at certain times of the day. Each user has their own time settings for these reminders.

Right now I have a background task that, once every 15 mins, looks at every gosh darn user and decides whether or not to send them a reminder. Once every 15 mins I hit every single user row in the database. brrrrrrr. Yeah, I know, gives me the willies too.

So is there some sort of a "best practice" established for how to do some caching/intelligent-scheduling to overcome this horrendous mis-use of the database? Is there some sort of a plugin that would allow me to sort the database by that time-of-day column and intelligently find just those within a certain range of times?

I'm very nearly an SQL virgin, so if this is painfully simple and I'm just a moron, be gentle.

thanks, jp

You should be able to construct a query (or use find with :conditions) and let the database do the work. Post the relative tables, fields, and classes and I'm sure you'll get many offers of help.

Peace.

So, gosh, maybe you're running your server on an ancient Intel 286 box on Craplix "Muddy Mammoth" -- 15 minutes? so even that box executed, oh, eleventeen gazillion instructions in that time period?

Plus /yet another/ SQL query. Brutal. I'll bet your phone is ringing off the wall with customer complaints.

When your traffic gets twitteresque, please call back. Operators are standing by.

/* see also "premature optimization" */ :slight_smile:

Build yourself a carousel and with buckets some time (15 min?) apart. Install each user's task in it's bucket and rotate the carousel with the clock. As a bucket appears, service any tasks it contains and move them to the appropriate future bucket.

This is fairly easy to do with a doubly linked list - I'm not to sure yet how it maps into the MVC world view.

Hi Jeff,

Depending on how you have the data stored, yes you should be able to something in SQL to make the search better.

Does each user just have one time of day assigned to them? If so, then you can just put a condition on the find query.

Something like user.reminder_time > start_range and user.reminder_time < end_range and let the DB do the grunt work for you.

If the user notification is something like 'every 4 hours' then you are may have to process each user individually to work out whether the current time fits that range. Or have a 'next_notification' col that you update each time you send a notification to that user. Then do the db query on the next_notification field.

Cheers Simon

When your traffic gets twitteresque, please call back. Operators are standing by.

/* see also "premature optimization" */ :slight_smile:

There is no so such thing as premature optimization, maybe mis- prioritized optimization (may be more impactful things to address first), but if you wait until you "have to" optimize to do it...it is too late. Especially is you cansider tha optimization may require rethinking table structure and/or other fundamental architecture. There is however a such thing as a premature jerky, smart ass response like the one above..

That said, a couple of questions first: - How is your query structured now? find(:all, ???)? - What restrictions do you have on the reminder period if any? Can they chose any time of the day, any frequency? Or is it restricted ot "every 15 minutes", "every hour" etc? - How many users do you have now? How many do you expect need to scale too? - How long are the queries taking now? too long?

This are questions that you should answer/aks when making a decision on the need and approach for optimization. For instance restricting the notification times/intervals to a few fixed values lends itself to multiple improvements, at a cahcing/queing level, as well as at an SQL query level.

A few VERY general ideas on how to approach this are:

1) SQL optimization, use direct SQL connections and optimize the query as appropriate for your DB type using constructs such as SORT, GROUP BY, creating indexes,etc. 2) Do the sort/search in "object space" using a partitioning approach. 3) Fake a crude "bucketing" system using additional look-up tables, or more specifically has_many relationships to parition based on fixed intervals. for example an hours_of_day model that has_many_and belongs_to_many :reminders, with one row for each of the 24 hours of the day. You could then relate each reminder to on one of the rows. Your background process then ask for all records that are associated with the row representing the current time....convoluted I know. The performance banifit may be questionable because of the JOINs required for a habtm look-up, but it should be less brute force then a find all.

I have quite a few others ideas, but will leave it at this....

There is no so such thing as premature optimization, maybe mis- prioritized optimization (may be more impactful things to address first), but if you wait until you "have to" optimize to do it...it is too late.

And the OP presented zero evidence of an actual problem, or hint of any potential problem. Running one SQL query every 15 minutes seems, uh, not too much to worry about. IMO.

There is however a such thing as a premature jerky, smart ass response like the one above..

Thanks for the evaluation. And your qualifications on constructing and running high-traffic web sites? Just curious.

Did you read the original post?

It's not just one SQL, it is one SQL and then hit every user object returned to see if it is time to send the email. Anyone can see that eventually this is going to cause a problem, and I don't know about you, but I would rather find that sort of thing now, before my customers notice a problem than wait until it actually causes one.

Oh, and you don't actually need a qualification to be able to pick a premature jerky response, just some experience on the internet.

There are up to two per day for each user, but they can be handled separately. Yes, currently using the disgusting "find(:all)" approach.

No, with just 100 users so far there is no sign of a problem. I just consider the current method to be very inelegant and would rather do it in a more intelligent way. Even if, once we get to a few thousand users, it still isn't causing a "problem", it will be using significant resources, thus reducing the responsiveness of the app. That said, some up-front profiling of the current method might convince me that I'm wasting my time. I'll do that first. Point taken (even if made with more than a touch of venom)

So is it generally the case that having the db hunt down the few records we're after each time will execute substantially faster than returning all of the records and doing the comparison "outside the database"? As a novice db guy, the two approaches sound roughly similar, since both cases involve doing essentially the same comparison on the same number of records. Perhaps there is a lot of overhead involved in returning records to the app.

I would think that some method of caching the info such that it didn't have to actually go to the db every time would make the biggest difference in execution time. Unfortunately, such a solution would also be more complex than just using a better query.

thanks for all the responses.

best, jp

Bsytlz wrote:

Jeff Pritchard wrote:

So is it generally the case that having the db hunt down the few records we're after each time will execute substantially faster than returning all of the records and doing the comparison "outside the database"? As a novice db guy, the two approaches sound roughly similar, since both cases involve doing essentially the same comparison on the same number of records. Perhaps there is a lot of overhead involved in returning records to the app.

Yes, the database is going to be faster at figuring out which records need to be returned, ASSUMING proper indexing, table structure, and generally sound approach to querying. If you don't use indexes, your data is not structured well, and you're querying on half the fields in the database, it may well be faster to just pull it all back into Rails and work through it.

As to why this is so, two major things come to mind: marshaling the data from the db server to the application and object creation. Suppose you have your thousand users in your system, but only 20 of them have reminders to send. If the database finds them for you, it returns just twenty and your application creates just twenty objects (and you have no more decisions to make). If you pull all one thousand back first, the database has to send more data back to you and then you're going to create 1000 objects in your application. If you need related data for the decision making, you're going to be making more than 1000 objects, and possibly more database queries (unless you've used eager loading, which increases the amount of data sent back to the app).

Whatever creates the data should be the fastest at ripping through it. The database engine is optimized for data storage and retrieval. When used properly, nothing is going to be faster than the db engine's native access to the data.

Peace.,

Jeff, A few pointers since I am maintaining and enhancing a rather large application with similar functionality. It seems to work well (most of the times):

I am firing an SQL query every minute to check for reminders instead of 15 minutes so the actual query is not an issue. Its construction is: you may want to create a separate table called "reminders" (or something similar) which stores reminders records for only those users who want them. That way if majority of your users do not use this functionality, you are not incurring an unnecessary overhead. This table will have an id as every rails table does and will have an index on id. It should also have the foreign keys, at least for the users (user_id), make sure that it is indexed as well since you will be joining reminders and users table based on this foreign key. You can have other foreign keys as well, for example, event_id if the reminders are for particular events. That is how one to many relationship works. You may want to refer to any of the online tutorials available on basic database design which is a prerequisite to making good rails applications.

The scheduling of these recurring tasks is also important. You want to use a cron style scheduler that does not load the rails environment every time the task is run. If you use the unix/linux cron to schedule a rails runner, it will load rails environment everytime you call the task which is a very slow and demanding process. My application uses Backgroundrb. I am looking for alternatives which are simpler, easier to debug, and more reliable.

The point of all this is that you are not taking on a trivial task, rather a substantial one if you want to better architect your application so you may want to think through it carefully. Having said that, you are thinking of the right things. May be you want to experiment with a prototype which is not in your critical path?

Hope this helps.

Regards,

Bharat

Jeff,

You could use the same approach as Ryan in his last backgournd job screencast: #129 Custom Daemon - RailsCasts.

Or take a look at some scheduler for the task (already passing the specific user information).

I spent some time researching background tasks, my notes might help you:

                BACKGROUND PROCESSES --------------------

http://wiki.rubyonrails.org/rails/pages/HowToRunBackgroundJobsInRails

== Asynchronous Tasks

Launched from Rails request/reponse cycle + optional Message Queueing

- Options   . ar_mailer (only for emails) => GOOD (only email)   . starling + workling => EXCELLENT   . Ap4r   . Spawn / background => SIMPLE   . ActiveMessaging   . BackgroundFu   . WorkerQueue   . Rake in Background => SIMPLES     (#127 Rake in Background - RailsCasts)

== Recurring Jobs (cron-style)

Scheduled or permanently running in the background

- Options   . daemon_generator (rails_cron) => SIMPLE     #129 Custom Daemon - RailsCasts   . BackgrounDRb   . rufus-scheduler => GOOD   . Taskr   . UNIX Cron and Runner (simple, but consumes resources and you can't controll well)

Cheers, Sazima

There are up to two per day for each user, but they can be handled separately. Yes, currently using the disgusting "find(:all)" approach.

Since you mentioned using a background task for this, I assumed a straight SQL query. My bad.

No, with just 100 users so far there is no sign of a problem. I just consider the current method to be very inelegant and would rather do it in a more intelligent way. Even if, once we get to a few thousand users, it still isn't causing a "problem", it will be using significant resources, thus reducing the responsiveness of the app.

My point, which I attempted -- apparently unsuccessfully! -- to make humorously, is that running a single SQL qeuery against a properly indexed database, with even a row count in the millions, is trivial. Really. And if it's being done from a background app, should have little or no effect on your Rails app's performance from the user's perspective anyway.

So is it generally the case that having the db hunt down the few records we're after each time will execute substantially faster than returning all of the records and doing the comparison "outside the database"?

Absolutely -- that's what an RDBMS is designed for.

I would think that some method of caching the info such that it didn't have to actually go to the db every time would make the biggest difference in execution time. Unfortunately, such a solution would also be more complex than just using a better query.

The execution time to retrieve the list and build the queue is trivial compared to the amount of time to actually process the mail queue, particularly as the user count grows.

Still, for the sake of argument -- build your "mailing list" in memory, and update it when a change event occurs in the app (user submits a changed profile, new user joins, etc.). Though you'll still incur at least one update/insert into the DB, which is more expensive than a select.

One last venom-free comment regarding premature optimization -- if you "fix" something today that /isn't/ broken, you may be making it harder to fix a /real/ bottleneck that crops up tomorrow.

Just sayin' ...

Best of luck,

Jeff, A few pointers since I am maintaining and enhancing a rather large application with similar functionality. It seems to work well (most of the times):

I am firing an SQL query every minute to check for reminders instead of 15 minutes so the actual query is not an issue. Its construction is: you may want to create a separate table called "reminders" (or something similar) which stores reminders records for only those users who want them. That way if majority of your users do not use this functionality, you are not incurring an unnecessary overhead. This table will have an id as every rails table does and will have an index on id. It should also have the foreign keys, at least for the users (user_id), make sure that it is indexed as well since you will be joining reminders and users table based on this foreign key. You can have other foreign keys as well, for example, event_id if the reminders are for particular events. That is how one to many relationship works. You may want to refer to any of the online tutorials available on basic database design which is a prerequisite to making good rails applications.

The scheduling of these recurring tasks is also important. You want to use a cron style scheduler that does not load the rails environment every time the task is run. If you use the unix/linux cron to schedule a rails runner, it will load rails environment everytime you call the task which is a very slow and demanding process. My application uses Backgroundrb. I am looking for alternatives which are simpler, easier to debug, and more reliable.

Can you file a bug report with a reliable way to reproduce them? I am sure, we can iron it out.

Hemant Kumar wrote:

Can you file a bug report with a reliable way to reproduce them? I am sure, we can iron it out.

OK Hemant, I will try to do it as constructively as possible. You have been most helpful. Just the thought of debugging Backgroundrb requires a lot of determination and will-power. Is there a simple guide to debug a mal-functioning Backgroundrb worker anywhere? I searched long and hard on the web and did not come up with much. If you could publish a series of tutorials on debugging that would be most helpful. The Advanced Rails Recipes book recipe is of no use now since the API has changed.

I have done everything by the book and my app still produces obscure debuggin messages in backgroundrb_debug.log files though things seem to be working. Some decent troubleshooting guides is what the community needs sorely.

Regards,

Bharat

Okay, I will see, what can be done about this. In the meanwhile, you can upgrade to 1.1 version where debugging is more sanitized.

First of all this is not an easy task.

There are two main part to for fill this task: gathering of users that needs to remind and sending the actual reminder.

Depending of your application data structure the selection of users should be done by the used database engine as they are build for that kind of tasks. You should build appropriate SQL statement, optimize indexes for that query and try to gain as much performance of it as possible.

Sending the reminders are the different issue and I will probably use set of daemons to perform it.

However, there are some additional issues you had to address.

Like: - What if for any reason the reminders are not send in appropriate time? Should you ignore that unsend reminder or not. - How would you handle increase number of reminders if your system can not send them in one circle time? How will you detect such situation? - If you use more threads to perform sending the reminder how will they share the same cache of selected users.

This is one of interesting problems that arise from a seams to be easy new feature like sending the reminder to the user.

If we have more information perhaps our observations could be more concrete.