GSoC Project Proposal: Scalable Database Interaction

Hello all,

I was hoping to get your suggestions and comments on my GSoC proposal. Let me know what you think!

Abstract

  Ruby on Rails currently lacks the ability to scale its communication with the database. Multiple versions of a rails application can be run and queries can be optimized, but after a certain point these alone are not enough. At some point the limitation of a single database will become too large of a bottleneck to ignore. There are people who realize this and avoid rails because they would much rather select a tool that handles the problem for them. Ruby on Rails needs a scalability solution built in because all serious web applications grow up, grow larger, and start to bring in more traffic. If Rails does not, then the people will go elsewhere, but if Rails does, then they can draw in people who previously would never have considered it. Tell us a little about yourself. My name is Allen and I will be graduating from Rochester Institute of Technology this May with a Bachelors of Science in Software Engineering with a minor in Computer Science and Psychology and a concentration in Business. Then in fall I will be starting my graduate degree in Computer Science studying database design and computer learning. After graduate school I want to start my own business which will develop web applications and most likely use Ruby on Rails.   I first started programming in 10th grade with Visual Basic, HTML, and JavaScript in a class called Computer Math. This is when I fell in love with programming and I started to develop some of my own side projects in class. I competed to join the computer programming team at my school and every year that we competed against other schools we were always in the top three. In 11th grade I learned C++ in AP Computer Science. In 12th grade I learned Java in AP Computer Science, which had just changed to Java that year. In that same year I also took IT Programming where I learned both ASP and PHP at the same time and had to develop systems that did the same thing in both languages. Throughout my college career I have had many interactions with various languages including the .Net languages, Python, Perl, PHP, Java, C++, XML, Schema, XSLT, JavaScript, and many others. As part of my major we are taught some of the most important concepts of software design, such as design patterns, verification and validation for testing, architecture design, designing distributed systems, and designing information systems. I have worked for three companies thus far in the technological industry. The first company was Measurement Specialties Inc. where I wrote a Visual Basic application to interface with a new type of gas pump system that measured gas flow. The system would control the flow of gas through the system, simulate variances in pressure, and took measurements which were stored in a database and allowed the data to be evaluated. The second company I worked for was Riverside Regional Medical Center in the financial department. At this company I wrote programs that would translate data from a multi-dimension database to a relational database. I also wrote programs to evaluate the data in the database to validate that statements were balanced. The third company, which I currently work part time at, is Rochester Software Associates. At this company I work primarily with a Java based web server that enables print flow management for large companies. At this company there is a large concern for scalability because our application is used by schools and companies that vary in size from about 100 users to 10,000.

What will your availability be to work on this project?

  I will be treating this project like a full time job. At least 40 hours a week will be spent on this project. I will be taking a class over the summer, which will account for 4 hours a week plus homework. However, this will be in addition to the 40 hours spent on the project.

Why do you use Rails? How would you like to see it improve?

  I use Rails because of its simplicity. I also like the good design patterns that were used that not only allow, but encourage good design during development. Lastly, I use it because it has a strong community that backs it.   There are a couple places I think Rails could improve. One area is JavaScript, where it would be nice if the helpers were unobtrusive. I would also like to see some support for action specific, controller specific, and application specific inclusion of resources like JavaScript and CSS. The last place I would like to see improve in Rails is its interaction with databases, which is what I am proposing for my Google summer of code project.   I would like Rails to support a scalable system for interaction with data out of the box. There are a few ways of doing this. One of the simplest ways is to separate tables with little or no relationship into separate databases. Another is to create a master-slave setup where all write actions are directed to a single master and all read actions are directed to one of the slaves. A third option is to have replicated databases for each instance of a Rails application, which is already possible in Rails. A fourth and probably most difficult but arguably the most scalable method is horizontally partitioning databases in a shared-nothing approach (sharding). Each one of these solutions has benefits and limitations and each is applicable to different problems.   Though I believe Rails should support these solutions for scalability, they should not be on by default. Scalability should be done as needed; otherwise there would be a lot of unnecessary overhead to setting up an application. If someone were to develop in Rails and their site never experienced scalability issues, they should never need to know about the various scalability options.

Why is this important to the Rails community at large? Why is this important to you?

  Ruby on Rails has a strong community. However, because there is no support for scalability when interacting with data there are many people who are hesitant to use Rails. People who want to build large scale heavy traffic websites are reluctant to invest development effort into a framework that does not fully support their goals. There are plugins that enable various scalability features, but enabling them tends to break something else, such as tests that use fixtures. Developers are also frightened away from developing more in depth solutions because it would require monkey patching which could potentially break things and also mean that there is another feature they have to maintain for each new version of Rails. If Rails had this type of support built in the Rails ecosystem could grow larger and gain more support from the people who avoid Rails for these reasons. Some of the people who avoid Rails are large companies. If Rails had the support of large companies the community would grow. In addition, if large companies became member of the Rails community there would be money behind developing for Rails which could lead to great new features and plugins.   This is important to me because I use Rails. If the Rails community grows it will inherently make developing in Rails easier for me. In addition to that I want to open a business developing web applications which have a very good chance of running into scalability issues and needing features like the ones I would develop. Also, in a purely selfish aspect, I enjoy the feeling I get when helping others and it is a very satisfying feeling knowing that you have affected the lives of many people.

List a clear set of goals/milestones you'll hit during the summer. Be specific.

  I am planning three milestones for this summer. Each milestone will be a completed solution which could be merged into Rails edge. Every subsequent milestone after the first will use the previous milestones as a basis. Each milestone will include requirements elicitation from the Rails community and documentation on how to enable and set up each feature besides what is listed below. I will also be testing various configurations of multiple databases to ensure the features work.   The first milestone will focus on handling multiple database connections and handling tables in multiple databases. This will include syntax for declaring multiple databases for development, testing, and production in database.yml. This syntax will include a way to name the connections and specify which connection is the default connection. This will allow connections to be specified in models by name and any model without a specified connection will use the default connection. Ideally, both fixtures and migrations will use the connection specified in the model, however this may not be possible and the connection may need to be specified in them.   The second milestone will focus on enabling a master-slave configuration for databases. Previous work with multiple connections will be used to enable this feature. A new syntax will be added to database.yml to allow the declaration of a master and slaves. It will also be possible to combine a master-slave setup with the model binding to a connection. In this case the master-slave connections will be named rather than just a single connection. All write actions will be routed to the master and all read actions will be routed to a slave. I am not sure how load balancing between slaves will work so I will get feedback from other developers on how it should work. I imagine the load balancing will be something that people may want to implement themselves for their specific setup. So a default configuration will be selected, but it will be easy to override, to allow different implementations. Fixtures and migrations will be updated to work in this new setup, but not much additional work will need to be done since writing is always handled by the master.   The third milestone is the most difficult. It will focus on database sharding (shared-nothing). There are many choices for how to implement this and I will rely heavily on the community when deciding how to implement this feature. Like the other two milestones there will be some method for declaring connections to be used as shards. There will be a way to specify models as being sharded. This will likely include declaring models global for common static lookup tables for types that should be replicated between shards. There will be a way to specify how a model is sharded. This feature will also support some kind of balancing for when new shards are added. Lastly, fixtures will be updated to support this new feature.

Give a rough timeline for hitting these milestones.

4/10/09 – 5/22/09 – Community bonding 5/23/09 – 6/19/09 – Milestone 1 – Multi-database connections, model binding to connections 6/20/09 – 7/10/09 – Milestone 2 – Master-slave 7/11/09 – 8/10/09 – Milestone 3 – Sharding 8/11/09 – 8/17/09 – Code cleanup, finalize documentation

How will you measure progress? How will you handle falling behind?

  At the beginning of each milestone I will determine a prioritized list of things that need to be added and things that would be nice to have. From this list I will scope out a projected schedule of when I need to complete each item. From this schedule I will be able to gauge my progress. If I start to fall behind, I will start cutting the features that are not required. If I fall drastically behind I will still complete all the needed features for that milestone and push back the date of the following milestone. I imagine it is very possible that I will not complete the third milestone even if I complete the other two on time. In that case I hope to at least have a strong basis for myself or someone else to continue and finish after Google summer of code is completed.

What are the "unknowns" in this project for you? What kind of pitfalls could you run into?

  I have not worked with Rails internal before. However, I know how to program in ruby and am very familiar with the design patterns and practices that Rails is built upon. I have not worked with load balancing and do not know how I might implement such a feature for the slaves in the master-slave configuration. I think that my sponsor would be able to assist me in this, though, and I can likely find adequate information about it on the internet and in books. Some of the pitfalls I could fall into would be trying to do too much, falling behind on my project, and missing important requirements. The first two risks can be mitigated by ranking important and planning a schedule which I discussed in the previous section. The last can be mitigated by rigorous testing, to ensure the feature works completely and as expected, and through regular communication with my sponsor and the Rails community.

Hi Allen,

First of all, forgive me my ignorance about databases and scalability.

The main question I have is: should handling multiple connections really be in rails/ActiveRecord?

I imagine most of the work (routing queries, maintaining connections, proxying the results) could be done on the driver/adapter side and the current ActiveRecord implementation could be left as it is.

By driver/adapter I mean either an ActiveRecord adapter or and external native library with an AR adapter interface.

If at all, hints about which table a query is for could be passed to the driver, if parsing the queries on the driver side is too big of a problem. The whole configuration IMHO should be on the binary driver side anyway to make use of database-specific support for scalability related features on a case-by-case basis.

It kind of seems like pulling the configuration of vendor specific options into rails, when this can all be handled by a special driver. One that might have it's own yaml file for a richer set of configuration options that are overhead, as you mentioned.

If an external driver/adapter is created, this can be used in any DB related project, relational also, a not just rails projects. And not just one DB vendor at a single time. This could also allow custom fine-tuning by anyone without patching rails.

Since this is only relevant for production environments, this will require special database setups that rails wouldn't really be able to automate to begin with.

My guess is putting this into rails might break a few things very quickly, e.g: - migration - associations - synchronizing data at ORM level

What I do see as useful to change in Rails for the problems you mentioned: - passing the model table along with the query to the "aggregating/routing" driver, so you don't have to extract it from the query. - creating an adapter for this driver that will additionally support table->connection options and pass them to the driver

The driver could just be an adaptor that makes use of other ActiveRecord adapters - which will keep the fun of writing everything in Ruby.

But then again, I might be completely missing the point.

It might even be funny to be able to do:

./script/plugin install db-scalability

:wink:

Cheers

Allen wrote:

Sorry for the late response. I posted one earlier, but it looks like it got moderated and I'm not sure why. Anyways, thanks for the suggestions. It is a big concern for me that I put this into the right place in the rails framework, whether it be directly integrated or handled as an attachment. I would love to hear other peoples opinions on this. I think there are pro's and con's to both approaches. One of the biggest pro's I can think of for it being integrated though is that people will be more likely to use it, because it won't have to be maintained separate from rails. That means if I want to upgrade my rails version, I won't be stuck with the version I am on until somebody updates the scalability aspects to support my setup. I think this one thing has a high appeal to companies, especially those who are apprehensive about using rails, because of its lack of scalability when communicating with distributed databases. Of course all of your points are valid, and I don't see myself fully making a decision until I've talked to my sponsor and gotten more feed back from the rest of the rails community. You and I are 2 developers out of dozens and I could implement a solution that fits us (not to say you were suggesting that), but completely miss the requirements of the vast majority of the rails community.

I do have a couple of additional comments on some of the things you mentioned, which I'll state next to your comments below. Forgive me if I sound attacking, I don't mean to. However, I would like to encourage you to attack my ideas. That way only the strongest will survive.

Hi Allen,

First of all, forgive me my ignorance about databases and scalability.

The main question I have is: should handling multiple connections really be in rails/ActiveRecord?

I imagine most of the work (routing queries, maintaining connections, proxying the results) could be done on the driver/adapter side and the current ActiveRecord implementation could be left as it is.

That is very possible, could you point me to some more information on adapters? I think if I were to go with this approach there would still be a lot of functionality that I could reuse from active record, so I think an adapter would be a better approach than a driver. In the case that I do extend active record, I think I may need to modify it to allow some of the extensions that would be required. This is purely theoretical though.

By driver/adapter I mean either an ActiveRecord adapter or and external native library with an AR adapter interface.

If at all, hints about which table a query is for could be passed to the driver, if parsing the queries on the driver side is too big of a problem. The whole configuration IMHO should be on the binary driver side anyway to make use of database-specific support for scalability related features on a case-by-case basis.

I think the first two milestones are for the most part database agnostic (I could be wrong). The third milestone, may however benefit from database-specific support, so maybe this should be externalized into an adapter. I would argue against a binary driver approach only because not all databases support sharding and I think rails can still provide a solution, though I think you are right that database specific stuff should not be in rails.

So I guess what I am suggesting would be somewhat of a hybrid. The first two options would be built into rails. The third option could be a vendor specific adapter or a generic non-vendor specific adapter (probably what I would write). In that case I could write a generic framework for a database sharding adapter, which other people could extend to make database specific adapters.

It kind of seems like pulling the configuration of vendor specific options into rails, when this can all be handled by a special driver. One that might have it's own yaml file for a richer set of configuration options that are overhead, as you mentioned.

You are right that I don't want to produce overhead, but I was talking specifically about when an application is first created. There is almost a 0% chance that you want to set up scalable database interaction when you first create a rails application. That is why I mentioned that the configuration should be invisible until it is needed (though I may not have done a very good job of it). So it is only when you do need to implement something scalable that the richer set of configuration options need to be used and at that time it won't be overhead, it will be necessary. And again, not to be pedantic, but I don't think this is vendor specific as I stated above. Perhaps you can explain further how you think it is?

If an external driver/adapter is created, this can be used in any DB related project, relational also, a not just rails projects. And not just one DB vendor at a single time. This could also allow custom fine-tuning by anyone without patching rails.

I do think that would be cool, to have it supportable by more frameworks. However, I think that may be out of the scope of what I am suggesting. What I am really talking about is an intermediary between the actual adapters and rails who's sole purpose is to direct queries to the right place. That may sound simplistic, but determining the right place to send a query is a very complicated task. I hope that makes sense. I explicitly don't want to do something vendor specific, rather I want it to work for any sort of database. I think the drivers/ adapters already available do a much better job that I could at handling vendor specific databases.

Since this is only relevant for production environments, this will require special database setups that rails wouldn't really be able to automate to begin with.

I disagree that this only relevant to production environments. Despite the work that I would be doing, there would be some configuration that a developer would have to do to get such a complicated setup working. Basically joining between databases and other things that arise with such a configuration would be sub-obtimal and the developer would need to be able to test that functionality to ensure those type of things don't happen.

You are right that rails wouldn't be able to automate this sort of setup, but I would argue that you wouldn't want it to. The type of setup you use would depend on how your data is laid out and I think it would be a task beyond rails ability to take the number of databases available and determine an appropriate setup based on how your data is laid out.

My guess is putting this into rails might break a few things very quickly, e.g: - migration - associations - synchronizing data at ORM level

I absolutely agree with this assertion, but I think this is true of extending the functionality of almost anything. I fully plan to fix these as part of my gsoc, though.

What I do see as useful to change in Rails for the problems you mentioned: - passing the model table along with the query to the "aggregating/routing" driver, so you don't have to extract it from the query. - creating an adapter for this driver that will additionally support table->connection options and pass them to the driver

I definitely would like to pass the model table. I want to keep this DRY as possible.

Here is where I think our ideas are disconnected, if I am understanding what you're saying correctly. You think the driver should handle the multiple connections and if I were to implement such a thing, it would have to be vendor specific. However, I think it would be much more beneficial if rails handled the multiple connections and submitted queries to the driver specified. I think this approach is more generic and would allow all drivers currently available to be used and still offer scalability options.

The driver could just be an adaptor that makes use of other ActiveRecord adapters - which will keep the fun of writing everything in Ruby.

I guess I restated what you said here up above (whoops!).

But then again, I might be completely missing the point.

If you missed the point its only because I didn't explain myself well enough.

Hi Allen, I read your proposal and it looks interesting. Rails scaling was a very hot topic in the community around last year I think. Apparently twitter was having problems scaling with a single database, but a quick solution was found (I think that it was a 75 lines of code plugin). I don't know the details of what that guy did, but it was the solution to twitter scaling issues. You should also consider other big applications that are scaling still using a single database. basecamp for example has more than a million users and I think that it runs in a single database.

Taking all those things into consideration can give a better idea of what you need to build. I hope this helps.

Hey Carlos,

Thanks for the input. I have looked into some of twitters scaling issues and it seems they suggest a lot of what DHH does in that article.

The article makes a very good point that the developers shouldn't be spending time on deploying a scalable database architecture until they need it; which I agree with. What I do think rails can benefit from though is something that makes it easy to scale when it is needed. One thing I think it could easily do automatically is the type of denormalization that is talked about in the twitter presentation. Basically an auto-magic column that stores a relationship from the has_one, has_many, and has_and_belongs_to_many side of the model relationship. This would make partitioning data substantially easier later on. Then you do something like Article.find( id, :include => 'comments') without caring if the information is in the same database.

Here's some semi-pseudo code to illustrate what would go on behind the scenes: if Article and Comment are in the same database #which is known since models are bound to connections     do what rails already does else     #automatically grabs the right connection, because it was bound in the model     article = Article.find(id)

    #auto grabs the right connection as well     #uses the autogen'd column and matches against id which is faster because ids are indexed     article.comments = Comment.find( :all, :conditions => [ 'id IN (?)', article.comment_ids] ) end

I think there are a lot of instances like this where, behind the scenes from day one, rails can prepare your application to be scaled without you ever being concerned about it. When you do need it, you set up your database configuration, perhaps run a generate task to make a special migration for the type of database setup you have and let rails handle the rest. Of course I'm oversimplifying this a little; this is no small task.