Multiple customers - keeping the data separate - how?

I'm convinced this can be done in a simple and effective manner.

I'm sure it can. Based on your comments you seems to be going for a coarser-grained solution compared to the problem that I'm trying to solve.

Doing this through DNS implies a number of things:

1. You can wait for DNS propagation (assuming you're talking the Internet and not an Intranet). 2. You have the liberty to create a separate application cluster per customer (all using essentially the same code base, with a config per customer).

Assuming this is true for you then, I would agree that you can do this pretty simply.

My situation is different.

I need quick provisioning for new customers (on the order of seconds to a couple of minutes), and I cannot consider automatic provisioning of a new application cluster each time for a new customer.

In my situation I have to be able to share application clusters (running on many machines) with a number of databases on the back end.

To CWK's points:

I'm not thinking that building a multiple tenant by breaking up the DB by tenant is trivial in my case, but it is also not as dire as you portray is. But I generally agree with you: I would like to have everything in one DB for maintenance, but reality is forcing my hand.

Firstly: the application I'm working on is a live Internet application and is already database limited. Performance of the middleware is not even remotely a factor. Our Web servers are basically asleep. So my primary concern is scaling the DB layer.

We've already investigated a number of possible cluster/federation schemes and they do not scale nearly as well as the vendors would like you to believe. In our tests data partioning per customer has given, by far, the best overall performance boost.

My comment about big number is this: No matter what your DB solution is there is some number of aggregate rows in the DB where performance will diminish "quickly". Generally speaking, you go about and index your data to get better (read) performance, but indexing provides the maximal benefit when either the indexes of all your hot tables can fit in RAM or result in very few hits to disk. However, as time goes on fewer and fewer of your indexes will fit in RAM, and even your less hot tables and indexes become significant. Your DB starts to become disk bound. "Disk bound" is one foot in the grave. So what to do? I can't ignore it.

Right now we have a logically partitioned customer data in the same DB. All living cozy inside the same schema. This does make many things easy, but it makes scaling VERY hard.

At large numbers things start to behave very differently as off-the-shelf solutions kind of cease to work, period.

Well, I think that you would agree that this is hyperbole. Off the shelf solutions, RDBMSes you mean, are clearly not the fastest things on the planet, but they keep the data mailable, I'm not aware of alternatives that have both the relative ease of data manipulation of RDBMSes and the reasonably good performance they posses. (Ah, perhaps Google has some goodies in-house, alas, I'm not Google)

Keeping RDBMSes operating at a healthy level can be done for a long time, but you will eventually need to give up some comfort. In this case the "all-in-one-db" approach.

This is a !@#$-load of tricky plumbing to avoid setting and watching client entitlements on database rows. Let alone the havoc this could wreak with managing the database--depending on which one you use, this could complicate how you deal with tablespaces and such.

No doubt that there is some plumbing that needs to be put into place. But I'm doing this to gain performance primarily. I gain the performance on two levels: the DB server in question is operating on relatively "small" DBs meaning intrinsically improved performance, then there are logical performance improvements I can make. Right now I have to do checks on security for the "owners" as well another users (our application allows clients to publish their data to other users of our system ans well as the general public). Once an owner (or assistants, who have the same privileges as the owner) has logged in I need perform no checks on access to their data.

> - Incremental application migration

This is beneficial if you want to maintain multiple software versions. If that's the case then you might as well just install complete app instances per client and be done with it. Been there many times, will never do it again unless building something like an ERP where it still makes some sense.

Agreed, this can be a pain. I wasn't referring maintaining an arbitrary number of versions of the app, just no more than two at once: old and new. This situation would only be temporary, while a roll-out was occurring.

> - Overall better performance

TANSTAAFL. If your database server is running twenty database instances, there is going to be some kind of performance hit to that versus one DB with tables 20 times larger. The overhead associated with connection pools and query caches et. al. could in many cases be much larger than the hit to scanning tables 20 times longer. I just don't accept this as an open-shut benefit right off the bat.

The benefit come from the fact that I can 1 or more DB instances on a given DB server. How I want to tune performance is entire up toy me. In fact what ever tuning trick you can do in a single Db instance to gain performance I can do with a one DB per customer setup, but the converse is not true: there are things that can be done in a one DB per customer configuration that cannot be done in a single DB approach. I'm not saying that I can accomplish this easily, but it can be done.

> - The ability to manage performance better (one big hot client can > be moved to their own Db server)

There's no reason you can't do this with a multi-tenant system too. For that matter you can run a special client on their own complete system instance with no or very little fancy plumbing.

In my case the DB, not the web application is the problem. Otherwise, yes, I agree with you.

> The DB dumper is something that has to be maintained! You're not > getting out of the fact that you will have to do work to make it > seem like each tenant is an island.

Snap response: Implement some kind of to_sql method which can be called recursively through the object tree, starting with the root object representing a client. For all I know facilities for this already exist within ActiveRecord which after all has to know how to generate SQL. Or just serialize stuff into yaml, or something like that.

Not to mention that you may find (as I did) that clients want/like human-readable backups, not SQL dumps.

But with a per client DB approach *I* get the ability to backup and restore data in a per client basis, a far more regular occurrence. And I can do this using high-performance tools without writing anything (except my app to work like this). As far as the "human" readable part I could also write such a script. Also, in case you haven't tried serializing with Ruby is REAL SLOW, loading the data with Ruby is no race winner either. Clearly no one has to be confined to using Ruby to do this. But yet again the per client DB approach wins for flexibility out of the box.

I still think the "not trivial" aspect understates it by two-thirds.

Fair enough.

It seems to me like you're building a unique configuration that will end up having a lot more dependencies on the versions of the framework, O/S, database config, etc. than is obvious from this vantage point. The end result could be that every time you do a major rev of any piece, you risk the whole thing falling apart and being the one guy in the world with that specific problem, and needing to stay on MySQL 3.1 for a year aftr its release until the low-priroity bug gets fixed. Yeah, I know it's a hypothetical, but it's the kind of hypothetical that's bitten me in the rear multiple times. The all-in-one approach has been by far the easiest to maintain and operate of all the approaches I've been involved with.

Possibility, but I generally doubt it.

Well like I said above I agree that it poses certain challenges--you end up needing to build a high-performance application even though all your customers are 5-seat installations. I do agree that this is ultimately probably an issue best solved in the database, but I'm not sure that the approach posited here isn't trading getting stabbed for getting shot.

In my case (unlike Neil) I'm doing this explicitly for the performance benefits I can attain.

Your warnings have been heard, I will take them into account. Much appreciated.

Jim Powers

Neil Wilson wrote:

Phlip wrote:

> So you have wall-to-wall unit tests, right?

No, and neither do you. Even if you think you have :wink:

I can add an 'assert(false)' to any block in my program, and tests will get to it.

(The remaining useless debate centers on a useful definition for "wall to wall".)

The more security you want, the more tests you need.

We're actually planning on doing something exactly like this. In our case, each tenant represents an institution, with up to hundreds of users. We're not concerned about quick provisioning of new tenants - signing one up and migrating them is a large, manual process regardless. Having one db, one OS user, and one domain name per tenant simplifies *a lot*.

* Frees the code from having to track tenants. Otherwise, every row would need a tenant_id, and *every* find would have to scope to tenant_id * Built in, bullet proof data partitioning * Ability to move tenants to separate servers, or let them host their own

Implementation is straightforward: * All tenants run off the same codeline * The codeline is checked out once - to one place * Each tenant has their own environment - which is identical except for the db * and their own domain name, which is tenantsname.ourapp.com * deployment / new versions are run by script against all the dbs / mongrels * one mongrel per tenant - all running off the same code dir - but in different env, and different db

Now, the only thing which really concerns me is the fact that we're stuck with 1 Mongrel per tenant. With a lot of tenants, and each mongrel using 20-50MB of memory, that could get ugly. Its possible that the swap file will handle all of this - swapping out Mongrels belonging to tenants that aren't online - but this won't help much, as during peak times nearly every tenant will be using the system.

One very simple solution to this would be to mod ActiveRecord to not use persistent database connections. This could be something as simple as an around_filter, establishing a connection to the appropriate db, and tearing down afterwards. This would let all of our mongrels be used for any tenant.

Persistent db connections aren't necessarily that helpful, anyway. With MySQL, for instance, on a LAN, they're hardly noticable. I know that for SQL Server, MS stopped recommending them, as well. They're feeling being that if you have a lot of apps on 1 db, it's better for each one to hang up when they're done. Better the overhead of connect/teardown than of keeping numerous dormant connections.

Another possible concern I have is session collision - although I'm not sure if this is even possible - I need to investigate the different ways Rails handles sessions.

Last, we've already written a little code to help with some of the unique db issues - enforcing that only one tenant ever uses one db.

Neil, if you or anyone else is interested on collobarating to help make the scripts and tools needed to make this a reality, please speak up. (Please keep posts to list, not private email.)

Coming in a bit late here... This is an issue we have had for quite a while, as we store financial data and it absolutely cannot get mixed up. IMO this is one area where some logic should go in the database, and the easiest solution is using a database that gives you the right tools. You can absolutely keep client data separate and have it all in one database and normalized by using functions and views, at least with databases like postgresql and oracle. We make a few adjustments here and there, such as not being able to use some of the AR methods for inserts and updates, but a small library of custom methods is a whole lot easier then having hundreds of databases.

A bigger issue is proper testing and good change management habits. Most bugs I see in working production systems is when some developer gets the itch to upgrade something or fix an existing bug and pushes it into production without adequate testing. The other leading thing would be making too many changes over too short a time period. If reliability and data integrity are at the top of your list, then the fact is you just have to be more conservative with how often you change stuff or add new features. You can have the best system in the world for keeping your client data separate, but if your people have bad habits it won't matter.

Chris

I can understand the desire to try and get the Mongrel count down, but the worry I have with reusing Mongrels is that the objectspace is potentially polluted with ActiveRecord data from a previous tenant. I don't want the added complexity of database separation and then find that the separation has broken down because I'm recycling Objectspaces and there is a cyclic graph in my object hierarchy keeping old AR instances out of the clutches of the garbage collector.

I see this as akin to an operating system. Let's get processes working and see how they handle things before we invent threads. It may be that Moore's Law rides to the rescue again.

S. Robert James wrote:

Neil, if you or anyone else is interested on collobarating to help make the scripts and tools needed to make this a reality, please speak up. (Please keep posts to list, not private email.)

I want to see how this works. Let's build it, but let's build the simplest thing that will work first - total separation and a separated tenancy provisioning system.

NeilW

Neil Wilson wrote:

I see this as akin to an operating system. Let's get processes working and see how they handle things before we invent threads. It may be that Moore's Law rides to the rescue again.

I want to see how this works. Let's build it, but let's build the simplest thing that will work first - total separation and a separated tenancy provisioning system.

Agreed. Consider the project started. And with the motto "make the simplest thing that could possibly work".

I think the first task is expand capistrano to be able to tell it to run one task for a list environments. Migrate all the environments, restart all the mongrels, take 'em all down.

SCM check outs remain the same - we'll use one SCM branch for all the instances.

I'm also working on a simple tool for cron / daemon jobs - again, one cmd to start/stop them all for all of the environments.

You can deal with a lot of your application security by just using associations correctly.

A before filter sets a user object based on the value in session.

@user = User.find session[:user]

In ProjectController, the list method is something like this

def list @projects = @user.projects end

There’s simply no need to worry about screwing up the relationships as long as you track what user owns things. When you save the data, make sure you save the owner of that record on every table and then let the relationships work themselves.

Worried about extra database hits? Then use eager loading where appropriate. Use a before_filter for the project controller that eager loads the projects for a user. Maybe even load more stuff. Or create methods on the user object to do your loading.

Roderick van Domburg wrote:

I agree. I've seen two approaches:

1. Perform scoping in-database by using views and triggers. A stored procedure is used to set up the views for the specific customer or user.

2. Perform scoping in the application. We've been using around_filter in Rails to wrap entire controllers in a with_scope. However, reading recent threads on Rails-core, with_scope will go protected which will make this approach extremely impractical.

Seeing how my idea of going about option #2 is going to be deprecated in Rails, I share your curiosity as to what _is_ the optimal solution. Starting every single action with a with_scope sure may be traceable but its repetition seems greatly inefficient.

Just keep on doing. You don't have to agree with the core - you can just send(:with_scope, params).

But, even better: it's protected, not deprecated. Define a method with_scope_for_user(user) in your model, mark it public, and have it call with_scope. That's much better anyway.

nuno wrote:

May be the solution would be to use a virtualization system ? Like the one that is available under Linux (Xen)

I see Xen as part of the solution, but not in the way that you imagine.

NeilW

Cw K. wrote:

Part of me

doubts whether there is a good generalized approach to this at the framework level.

Does a tenant ever need to see another tenants data in a manner that couldn't be achieved simply by giving an individual a user id in both tenant's user list?

You see I still see the user list, group list, access control lists and authentication/authorisation role system within an application space. You have to do that and the structure is indeed different and evolving for every application there is.

But the tenant can be removed to framework level, cos a tenant is just a good old fashioned user at infrastructure level and half the job is already done by the standard Unix user tools.

You've got to admit that

rake remote:exec ACTION="invoke" COMMAND="adduser new_tenant" SUDO="yes" cap -s user=new_tenant -a cold_deploy rake remote:exec ACTION="invoke" COMMAND="invoke-rc.d apache2 reload" SUDO="yes"

has a certain succinct charm to it. I wonder how close to this ideal I can get and how much it costs in real terms?

NeilW

S. Robert James wrote:

Agreed. Consider the project started. And with the motto "make the simplest thing that could possibly work".

I think the first task is expand capistrano to be able to tell it to run one task for a list environments. Migrate all the environments, restart all the mongrels, take 'em all down.

That depends how you separate the tenants. If you make a tenant a Unix user, then the job is (potentially) trivial:

for word in `cat list_of_tenants`; do cap -s user=$word -a update; done for word in `cat list_of_tenants`; do cap -s user=$word -a restart; done

I'm also working on a simple tool for cron / daemon jobs - again, one cmd to start/stop them all for all of the environments.

Again in theory if you make a tenant a Unix user, then the cron jobs all run in the user's crontab in the user space, and so do all the daemons for that tenant. So restarting them just needs a dose of 'killall' and a script running while running as the correct user.

You can user the @reboot facility of cron to bring the Mongrels up for a tenant when the machine starts, and a daily cron entry to restart them to keep memory under control.

I like the idea of tenant = Unix user. It has a certain conceptual charm to it, and if I can make it work it gives me a ton of leverage of the base Unix tools.

Barking?

NeilW

Two great articles discussing exactly this: http://msdn2.microsoft.com/en-us/library/aa479086.aspx http://msdn2.microsoft.com/en-us/library/aa479069.aspx

Neil Wilson wrote:

One thing I would add to this is that even when using separate databases or schema's, it pays to design your tables as if the data was all in one database/schema.

Also as an FYI for those that are interested. We spent a good amount of time working on different ways to use rails in an environment where user data was separated by schema's. One thing that's worked fairly well is the set_table_name method, which can be used to set the schema.tablename at the start of each request. At a slight hit in performance we actually do something like the following:

- Start of request - set_table_name 'schema.table' - Do stuff - set_table_name 'none'

V. Interesting. Thanks for that.

BTW You'll be glad to hear that the Multi-tenant system is progressing (at snail's gallop, but at least it's moving forward). I have a brittle proof of concept up on a Debian Etch Xen platform.

One of the interesting side effects of using Capistrano to deploy code once per tenant is that file system sessions suddenly scale rather well.

Since multi_tenant is built entirely as a set of Capistrano recipes and plugins I'll probably run any posts on the Capistrano group rather than here - where it may get lost in the noise.

Stay tuned

NeilW

Neil - I've tried emailing you directly but they bounce - could you email me - I think we may be able to collobarate here.

Neil Wilson wrote:

One thing I've done to keep DRY: # environment.rb require 'config/tenants'

# tenants.rb PRODUCTION_TENANTS = ['joe', 'fred', 'bob']

# database.yml <% PRODUCTION_TENANTS.each do |tenant| %> production_<%= tenant %>:   adapter: postgresql   database: <%= tenant %>   username:<%= tenant %>   password: useasharedpasswordforalltenants   host: localhost

One hitch I've had is rails wants to load (environemntname.rb) and crashes if it can't. I'd rather use the same production.rb for all of them. Any ideas?