GSoC Project Proposal: Scalable Database Interaction

Hello all,

I was hoping to get your suggestions and comments on my GSoC proposal.
Let me know what you think!

Abstract

  Ruby on Rails currently lacks the ability to scale its communication
with the database. Multiple versions of a rails application can be run
and queries can be optimized, but after a certain point these alone
are not enough. At some point the limitation of a single database will
become too large of a bottleneck to ignore. There are people who
realize this and avoid rails because they would much rather select a
tool that handles the problem for them. Ruby on Rails needs a
scalability solution built in because all serious web applications
grow up, grow larger, and start to bring in more traffic. If Rails
does not, then the people will go elsewhere, but if Rails does, then
they can draw in people who previously would never have considered it.
Tell us a little about yourself.
My name is Allen and I will be graduating from Rochester Institute of
Technology this May with a Bachelors of Science in Software
Engineering with a minor in Computer Science and Psychology and a
concentration in Business. Then in fall I will be starting my graduate
degree in Computer Science studying database design and computer
learning. After graduate school I want to start my own business which
will develop web applications and most likely use Ruby on Rails.
  I first started programming in 10th grade with Visual Basic, HTML,
and JavaScript in a class called Computer Math. This is when I fell in
love with programming and I started to develop some of my own side
projects in class. I competed to join the computer programming team at
my school and every year that we competed against other schools we
were always in the top three. In 11th grade I learned C++ in AP
Computer Science. In 12th grade I learned Java in AP Computer Science,
which had just changed to Java that year. In that same year I also
took IT Programming where I learned both ASP and PHP at the same time
and had to develop systems that did the same thing in both languages.
Throughout my college career I have had many interactions with various
languages including the .Net languages, Python, Perl, PHP, Java, C++,
XML, Schema, XSLT, JavaScript, and many others. As part of my major we
are taught some of the most important concepts of software design,
such as design patterns, verification and validation for testing,
architecture design, designing distributed systems, and designing
information systems.
I have worked for three companies thus far in the technological
industry. The first company was Measurement Specialties Inc. where I
wrote a Visual Basic application to interface with a new type of gas
pump system that measured gas flow. The system would control the flow
of gas through the system, simulate variances in pressure, and took
measurements which were stored in a database and allowed the data to
be evaluated. The second company I worked for was Riverside Regional
Medical Center in the financial department. At this company I wrote
programs that would translate data from a multi-dimension database to
a relational database. I also wrote programs to evaluate the data in
the database to validate that statements were balanced. The third
company, which I currently work part time at, is Rochester Software
Associates. At this company I work primarily with a Java based web
server that enables print flow management for large companies. At this
company there is a large concern for scalability because our
application is used by schools and companies that vary in size from
about 100 users to 10,000.

What will your availability be to work on this project?

  I will be treating this project like a full time job. At least 40
hours a week will be spent on this project. I will be taking a class
over the summer, which will account for 4 hours a week plus homework.
However, this will be in addition to the 40 hours spent on the
project.

Why do you use Rails? How would you like to see it improve?

  I use Rails because of its simplicity. I also like the good design
patterns that were used that not only allow, but encourage good design
during development. Lastly, I use it because it has a strong community
that backs it.
  There are a couple places I think Rails could improve. One area is
JavaScript, where it would be nice if the helpers were unobtrusive. I
would also like to see some support for action specific, controller
specific, and application specific inclusion of resources like
JavaScript and CSS. The last place I would like to see improve in
Rails is its interaction with databases, which is what I am proposing
for my Google summer of code project.
  I would like Rails to support a scalable system for interaction with
data out of the box. There are a few ways of doing this. One of the
simplest ways is to separate tables with little or no relationship
into separate databases. Another is to create a master-slave setup
where all write actions are directed to a single master and all read
actions are directed to one of the slaves. A third option is to have
replicated databases for each instance of a Rails application, which
is already possible in Rails. A fourth and probably most difficult but
arguably the most scalable method is horizontally partitioning
databases in a shared-nothing approach (sharding). Each one of these
solutions has benefits and limitations and each is applicable to
different problems.
  Though I believe Rails should support these solutions for
scalability, they should not be on by default. Scalability should be
done as needed; otherwise there would be a lot of unnecessary overhead
to setting up an application. If someone were to develop in Rails and
their site never experienced scalability issues, they should never
need to know about the various scalability options.

Why is this important to the Rails community at large? Why is this
important to you?

  Ruby on Rails has a strong community. However, because there is no
support for scalability when interacting with data there are many
people who are hesitant to use Rails. People who want to build large
scale heavy traffic websites are reluctant to invest development
effort into a framework that does not fully support their goals. There
are plugins that enable various scalability features, but enabling
them tends to break something else, such as tests that use fixtures.
Developers are also frightened away from developing more in depth
solutions because it would require monkey patching which could
potentially break things and also mean that there is another feature
they have to maintain for each new version of Rails. If Rails had this
type of support built in the Rails ecosystem could grow larger and
gain more support from the people who avoid Rails for these reasons.
Some of the people who avoid Rails are large companies. If Rails had
the support of large companies the community would grow. In addition,
if large companies became member of the Rails community there would be
money behind developing for Rails which could lead to great new
features and plugins.
  This is important to me because I use Rails. If the Rails community
grows it will inherently make developing in Rails easier for me. In
addition to that I want to open a business developing web applications
which have a very good chance of running into scalability issues and
needing features like the ones I would develop. Also, in a purely
selfish aspect, I enjoy the feeling I get when helping others and it
is a very satisfying feeling knowing that you have affected the lives
of many people.

List a clear set of goals/milestones you'll hit during the summer. Be
specific.

  I am planning three milestones for this summer. Each milestone will
be a completed solution which could be merged into Rails edge. Every
subsequent milestone after the first will use the previous milestones
as a basis. Each milestone will include requirements elicitation from
the Rails community and documentation on how to enable and set up each
feature besides what is listed below. I will also be testing various
configurations of multiple databases to ensure the features work.
  The first milestone will focus on handling multiple database
connections and handling tables in multiple databases. This will
include syntax for declaring multiple databases for development,
testing, and production in database.yml. This syntax will include a
way to name the connections and specify which connection is the
default connection. This will allow connections to be specified in
models by name and any model without a specified connection will use
the default connection. Ideally, both fixtures and migrations will use
the connection specified in the model, however this may not be
possible and the connection may need to be specified in them.
  The second milestone will focus on enabling a master-slave
configuration for databases. Previous work with multiple connections
will be used to enable this feature. A new syntax will be added to
database.yml to allow the declaration of a master and slaves. It will
also be possible to combine a master-slave setup with the model
binding to a connection. In this case the master-slave connections
will be named rather than just a single connection. All write actions
will be routed to the master and all read actions will be routed to a
slave. I am not sure how load balancing between slaves will work so I
will get feedback from other developers on how it should work. I
imagine the load balancing will be something that people may want to
implement themselves for their specific setup. So a default
configuration will be selected, but it will be easy to override, to
allow different implementations. Fixtures and migrations will be
updated to work in this new setup, but not much additional work will
need to be done since writing is always handled by the master.
  The third milestone is the most difficult. It will focus on database
sharding (shared-nothing). There are many choices for how to implement
this and I will rely heavily on the community when deciding how to
implement this feature. Like the other two milestones there will be
some method for declaring connections to be used as shards. There will
be a way to specify models as being sharded. This will likely include
declaring models global for common static lookup tables for types that
should be replicated between shards. There will be a way to specify
how a model is sharded. This feature will also support some kind of
balancing for when new shards are added. Lastly, fixtures will be
updated to support this new feature.

Give a rough timeline for hitting these milestones.

4/10/09 – 5/22/09 – Community bonding
5/23/09 – 6/19/09 – Milestone 1 – Multi-database connections, model
binding to connections
6/20/09 – 7/10/09 – Milestone 2 – Master-slave
7/11/09 – 8/10/09 – Milestone 3 – Sharding
8/11/09 – 8/17/09 – Code cleanup, finalize documentation

How will you measure progress? How will you handle falling behind?

  At the beginning of each milestone I will determine a prioritized
list of things that need to be added and things that would be nice to
have. From this list I will scope out a projected schedule of when I
need to complete each item. From this schedule I will be able to gauge
my progress. If I start to fall behind, I will start cutting the
features that are not required. If I fall drastically behind I will
still complete all the needed features for that milestone and push
back the date of the following milestone. I imagine it is very
possible that I will not complete the third milestone even if I
complete the other two on time. In that case I hope to at least have a
strong basis for myself or someone else to continue and finish after
Google summer of code is completed.

What are the "unknowns" in this project for you? What kind of pitfalls
could you run into?

  I have not worked with Rails internal before. However, I know how to
program in ruby and am very familiar with the design patterns and
practices that Rails is built upon. I have not worked with load
balancing and do not know how I might implement such a feature for the
slaves in the master-slave configuration. I think that my sponsor
would be able to assist me in this, though, and I can likely find
adequate information about it on the internet and in books. Some of
the pitfalls I could fall into would be trying to do too much, falling
behind on my project, and missing important requirements. The first
two risks can be mitigated by ranking important and planning a
schedule which I discussed in the previous section. The last can be
mitigated by rigorous testing, to ensure the feature works completely
and as expected, and through regular communication with my sponsor and
the Rails community.

Hi Allen,

First of all, forgive me my ignorance about databases and scalability.

The main question I have is: should handling multiple connections really
be in rails/ActiveRecord?

I imagine most of the work (routing queries, maintaining connections,
proxying the results) could be done on the driver/adapter side and the
current ActiveRecord implementation could be left as it is.

By driver/adapter I mean either an ActiveRecord adapter or and external
native library with an AR adapter interface.

If at all, hints about which table a query is for could be passed to the
driver, if parsing the queries on the driver side is too big of a
problem. The whole configuration IMHO should be on the binary driver
side anyway to make use of database-specific support for scalability
related features on a case-by-case basis.

It kind of seems like pulling the configuration of vendor specific
options into rails, when this can all be handled by a special driver.
One that might have it's own yaml file for a richer set of configuration
options that are overhead, as you mentioned.

If an external driver/adapter is created, this can be used in any DB
related project, relational also, a not just rails projects. And not
just one DB vendor at a single time. This could also allow custom
fine-tuning by anyone without patching rails.

Since this is only relevant for production environments, this will
require special database setups that rails wouldn't really be able to
automate to begin with.

My guess is putting this into rails might break a few things very
quickly, e.g:
- migration
- associations
- synchronizing data at ORM level

What I do see as useful to change in Rails for the problems you mentioned:
- passing the model table along with the query to the
"aggregating/routing" driver, so you don't have to extract it from the
query.
- creating an adapter for this driver that will additionally support
table->connection options and pass them to the driver

The driver could just be an adaptor that makes use of other ActiveRecord
adapters - which will keep the fun of writing everything in Ruby.

But then again, I might be completely missing the point.

It might even be funny to be able to do:

./script/plugin install db-scalability

:wink:

Cheers

Allen wrote:

Sorry for the late response. I posted one earlier, but it looks like
it got moderated and I'm not sure why. Anyways, thanks for the
suggestions. It is a big concern for me that I put this into the right
place in the rails framework, whether it be directly integrated or
handled as an attachment. I would love to hear other peoples opinions
on this. I think there are pro's and con's to both approaches.
One of the biggest pro's I can think of for it being integrated though
is that people will be more likely to use it, because it won't have to
be maintained separate from rails. That means if I want to upgrade my
rails version, I won't be stuck with the version I am on until
somebody updates the scalability aspects to support my setup. I think
this one thing has a high appeal to companies, especially those who
are apprehensive about using rails, because of its lack of scalability
when communicating with distributed databases.
Of course all of your points are valid, and I don't see myself fully
making a decision until I've talked to my sponsor and gotten more feed
back from the rest of the rails community. You and I are 2 developers
out of dozens and I could implement a solution that fits us (not to
say you were suggesting that), but completely miss the requirements of
the vast majority of the rails community.

I do have a couple of additional comments on some of the things you
mentioned, which I'll state next to your comments below. Forgive me if
I sound attacking, I don't mean to. However, I would like to encourage
you to attack my ideas. That way only the strongest will survive.

Hi Allen,

First of all, forgive me my ignorance about databases and scalability.

The main question I have is: should handling multiple connections really
be in rails/ActiveRecord?

I imagine most of the work (routing queries, maintaining connections,
proxying the results) could be done on the driver/adapter side and the
current ActiveRecord implementation could be left as it is.

That is very possible, could you point me to some more information on
adapters? I think if I were to go with this approach there would still
be a lot of functionality that I could reuse from active record, so I
think an adapter would be a better approach than a driver. In the case
that I do extend active record, I think I may need to modify it to
allow some of the extensions that would be required. This is purely
theoretical though.

By driver/adapter I mean either an ActiveRecord adapter or and external
native library with an AR adapter interface.

If at all, hints about which table a query is for could be passed to the
driver, if parsing the queries on the driver side is too big of a
problem. The whole configuration IMHO should be on the binary driver
side anyway to make use of database-specific support for scalability
related features on a case-by-case basis.

I think the first two milestones are for the most part database
agnostic (I could be wrong). The third milestone, may however benefit
from database-specific support, so maybe this should be externalized
into an adapter. I would argue against a binary driver approach only
because not all databases support sharding and I think rails can still
provide a solution, though I think you are right that database
specific stuff should not be in rails.

So I guess what I am suggesting would be somewhat of a hybrid. The
first two options would be built into rails. The third option could be
a vendor specific adapter or a generic non-vendor specific adapter
(probably what I would write). In that case I could write a generic
framework for a database sharding adapter, which other people could
extend to make database specific adapters.

It kind of seems like pulling the configuration of vendor specific
options into rails, when this can all be handled by a special driver.
One that might have it's own yaml file for a richer set of configuration
options that are overhead, as you mentioned.

You are right that I don't want to produce overhead, but I was talking
specifically about when an application is first created. There is
almost a 0% chance that you want to set up scalable database
interaction when you first create a rails application. That is why I
mentioned that the configuration should be invisible until it is
needed (though I may not have done a very good job of it). So it is
only when you do need to implement something scalable that the richer
set of configuration options need to be used and at that time it won't
be overhead, it will be necessary. And again, not to be pedantic, but
I don't think this is vendor specific as I stated above. Perhaps you
can explain further how you think it is?

If an external driver/adapter is created, this can be used in any DB
related project, relational also, a not just rails projects. And not
just one DB vendor at a single time. This could also allow custom
fine-tuning by anyone without patching rails.

I do think that would be cool, to have it supportable by more
frameworks. However, I think that may be out of the scope of what I am
suggesting. What I am really talking about is an intermediary between
the actual adapters and rails who's sole purpose is to direct queries
to the right place. That may sound simplistic, but determining the
right place to send a query is a very complicated task. I hope that
makes sense. I explicitly don't want to do something vendor specific,
rather I want it to work for any sort of database. I think the drivers/
adapters already available do a much better job that I could at
handling vendor specific databases.

Since this is only relevant for production environments, this will
require special database setups that rails wouldn't really be able to
automate to begin with.

I disagree that this only relevant to production environments. Despite
the work that I would be doing, there would be some configuration that
a developer would have to do to get such a complicated setup working.
Basically joining between databases and other things that arise with
such a configuration would be sub-obtimal and the developer would need
to be able to test that functionality to ensure those type of things
don't happen.

You are right that rails wouldn't be able to automate this sort of
setup, but I would argue that you wouldn't want it to. The type of
setup you use would depend on how your data is laid out and I think it
would be a task beyond rails ability to take the number of databases
available and determine an appropriate setup based on how your data is
laid out.

My guess is putting this into rails might break a few things very
quickly, e.g:
- migration
- associations
- synchronizing data at ORM level

I absolutely agree with this assertion, but I think this is true of
extending the functionality of almost anything. I fully plan to fix
these as part of my gsoc, though.

What I do see as useful to change in Rails for the problems you mentioned:
- passing the model table along with the query to the
"aggregating/routing" driver, so you don't have to extract it from the
query.
- creating an adapter for this driver that will additionally support
table->connection options and pass them to the driver

I definitely would like to pass the model table. I want to keep this
DRY as possible.

Here is where I think our ideas are disconnected, if I am
understanding what you're saying correctly. You think the driver
should handle the multiple connections and if I were to implement such
a thing, it would have to be vendor specific. However, I think it
would be much more beneficial if rails handled the multiple
connections and submitted queries to the driver specified. I think
this approach is more generic and would allow all drivers currently
available to be used and still offer scalability options.

The driver could just be an adaptor that makes use of other ActiveRecord
adapters - which will keep the fun of writing everything in Ruby.

I guess I restated what you said here up above (whoops!).

But then again, I might be completely missing the point.

If you missed the point its only because I didn't explain myself well
enough.

Hi Allen,
I read your proposal and it looks interesting. Rails scaling was a
very hot topic in the community around last year I think. Apparently
twitter was having problems scaling with a single database, but a
quick solution was found (I think that it was a 75 lines of code
plugin). I don't know the details of what that guy did, but it was the
solution to twitter scaling issues. You should also consider other big
applications that are scaling still using a single database. basecamp
for example has more than a million users and I think that it runs in
a single database.

http://www.37signals.com/svn/posts/1509-mr-moore-gets-to-punt-on-sharding

Taking all those things into consideration can give a better idea of
what you need to build. I hope this helps.

Hey Carlos,

Thanks for the input. I have looked into some of twitters scaling
issues and it seems they suggest a lot of what DHH does in that
article.
http://www.slideshare.net/Blaine/scaling-twitter

The article makes a very good point that the developers shouldn't be
spending time on deploying a scalable database architecture until they
need it; which I agree with. What I do think rails can benefit from
though is something that makes it easy to scale when it is needed. One
thing I think it could easily do automatically is the type of
denormalization that is talked about in the twitter presentation.
Basically an auto-magic column that stores a relationship from the
has_one, has_many, and has_and_belongs_to_many side of the model
relationship. This would make partitioning data substantially easier
later on. Then you do something like Article.find( id, :include =>
'comments') without caring if the information is in the same database.

Here's some semi-pseudo code to illustrate what would go on behind the
scenes:
if Article and Comment are in the same database #which is known since
models are bound to connections
    do what rails already does
else
    #automatically grabs the right connection, because it was bound in
the model
    article = Article.find(id)

    #auto grabs the right connection as well
    #uses the autogen'd column and matches against id which is faster
because ids are indexed
    article.comments = Comment.find( :all, :conditions => [ 'id IN
(?)', article.comment_ids] )
end

I think there are a lot of instances like this where, behind the
scenes from day one, rails can prepare your application to be scaled
without you ever being concerned about it. When you do need it, you
set up your database configuration, perhaps run a generate task to
make a special migration for the type of database setup you have and
let rails handle the rest. Of course I'm oversimplifying this a
little; this is no small task.