Hello all,
I was hoping to get your suggestions and comments on my GSoC proposal. Let me know what you think!
Abstract
Ruby on Rails currently lacks the ability to scale its communication with the database. Multiple versions of a rails application can be run and queries can be optimized, but after a certain point these alone are not enough. At some point the limitation of a single database will become too large of a bottleneck to ignore. There are people who realize this and avoid rails because they would much rather select a tool that handles the problem for them. Ruby on Rails needs a scalability solution built in because all serious web applications grow up, grow larger, and start to bring in more traffic. If Rails does not, then the people will go elsewhere, but if Rails does, then they can draw in people who previously would never have considered it. Tell us a little about yourself. My name is Allen and I will be graduating from Rochester Institute of Technology this May with a Bachelors of Science in Software Engineering with a minor in Computer Science and Psychology and a concentration in Business. Then in fall I will be starting my graduate degree in Computer Science studying database design and computer learning. After graduate school I want to start my own business which will develop web applications and most likely use Ruby on Rails. I first started programming in 10th grade with Visual Basic, HTML, and JavaScript in a class called Computer Math. This is when I fell in love with programming and I started to develop some of my own side projects in class. I competed to join the computer programming team at my school and every year that we competed against other schools we were always in the top three. In 11th grade I learned C++ in AP Computer Science. In 12th grade I learned Java in AP Computer Science, which had just changed to Java that year. In that same year I also took IT Programming where I learned both ASP and PHP at the same time and had to develop systems that did the same thing in both languages. Throughout my college career I have had many interactions with various languages including the .Net languages, Python, Perl, PHP, Java, C++, XML, Schema, XSLT, JavaScript, and many others. As part of my major we are taught some of the most important concepts of software design, such as design patterns, verification and validation for testing, architecture design, designing distributed systems, and designing information systems. I have worked for three companies thus far in the technological industry. The first company was Measurement Specialties Inc. where I wrote a Visual Basic application to interface with a new type of gas pump system that measured gas flow. The system would control the flow of gas through the system, simulate variances in pressure, and took measurements which were stored in a database and allowed the data to be evaluated. The second company I worked for was Riverside Regional Medical Center in the financial department. At this company I wrote programs that would translate data from a multi-dimension database to a relational database. I also wrote programs to evaluate the data in the database to validate that statements were balanced. The third company, which I currently work part time at, is Rochester Software Associates. At this company I work primarily with a Java based web server that enables print flow management for large companies. At this company there is a large concern for scalability because our application is used by schools and companies that vary in size from about 100 users to 10,000.
What will your availability be to work on this project?
I will be treating this project like a full time job. At least 40 hours a week will be spent on this project. I will be taking a class over the summer, which will account for 4 hours a week plus homework. However, this will be in addition to the 40 hours spent on the project.
Why do you use Rails? How would you like to see it improve?
I use Rails because of its simplicity. I also like the good design patterns that were used that not only allow, but encourage good design during development. Lastly, I use it because it has a strong community that backs it. There are a couple places I think Rails could improve. One area is JavaScript, where it would be nice if the helpers were unobtrusive. I would also like to see some support for action specific, controller specific, and application specific inclusion of resources like JavaScript and CSS. The last place I would like to see improve in Rails is its interaction with databases, which is what I am proposing for my Google summer of code project. I would like Rails to support a scalable system for interaction with data out of the box. There are a few ways of doing this. One of the simplest ways is to separate tables with little or no relationship into separate databases. Another is to create a master-slave setup where all write actions are directed to a single master and all read actions are directed to one of the slaves. A third option is to have replicated databases for each instance of a Rails application, which is already possible in Rails. A fourth and probably most difficult but arguably the most scalable method is horizontally partitioning databases in a shared-nothing approach (sharding). Each one of these solutions has benefits and limitations and each is applicable to different problems. Though I believe Rails should support these solutions for scalability, they should not be on by default. Scalability should be done as needed; otherwise there would be a lot of unnecessary overhead to setting up an application. If someone were to develop in Rails and their site never experienced scalability issues, they should never need to know about the various scalability options.
Why is this important to the Rails community at large? Why is this important to you?
Ruby on Rails has a strong community. However, because there is no support for scalability when interacting with data there are many people who are hesitant to use Rails. People who want to build large scale heavy traffic websites are reluctant to invest development effort into a framework that does not fully support their goals. There are plugins that enable various scalability features, but enabling them tends to break something else, such as tests that use fixtures. Developers are also frightened away from developing more in depth solutions because it would require monkey patching which could potentially break things and also mean that there is another feature they have to maintain for each new version of Rails. If Rails had this type of support built in the Rails ecosystem could grow larger and gain more support from the people who avoid Rails for these reasons. Some of the people who avoid Rails are large companies. If Rails had the support of large companies the community would grow. In addition, if large companies became member of the Rails community there would be money behind developing for Rails which could lead to great new features and plugins. This is important to me because I use Rails. If the Rails community grows it will inherently make developing in Rails easier for me. In addition to that I want to open a business developing web applications which have a very good chance of running into scalability issues and needing features like the ones I would develop. Also, in a purely selfish aspect, I enjoy the feeling I get when helping others and it is a very satisfying feeling knowing that you have affected the lives of many people.
List a clear set of goals/milestones you'll hit during the summer. Be specific.
I am planning three milestones for this summer. Each milestone will be a completed solution which could be merged into Rails edge. Every subsequent milestone after the first will use the previous milestones as a basis. Each milestone will include requirements elicitation from the Rails community and documentation on how to enable and set up each feature besides what is listed below. I will also be testing various configurations of multiple databases to ensure the features work. The first milestone will focus on handling multiple database connections and handling tables in multiple databases. This will include syntax for declaring multiple databases for development, testing, and production in database.yml. This syntax will include a way to name the connections and specify which connection is the default connection. This will allow connections to be specified in models by name and any model without a specified connection will use the default connection. Ideally, both fixtures and migrations will use the connection specified in the model, however this may not be possible and the connection may need to be specified in them. The second milestone will focus on enabling a master-slave configuration for databases. Previous work with multiple connections will be used to enable this feature. A new syntax will be added to database.yml to allow the declaration of a master and slaves. It will also be possible to combine a master-slave setup with the model binding to a connection. In this case the master-slave connections will be named rather than just a single connection. All write actions will be routed to the master and all read actions will be routed to a slave. I am not sure how load balancing between slaves will work so I will get feedback from other developers on how it should work. I imagine the load balancing will be something that people may want to implement themselves for their specific setup. So a default configuration will be selected, but it will be easy to override, to allow different implementations. Fixtures and migrations will be updated to work in this new setup, but not much additional work will need to be done since writing is always handled by the master. The third milestone is the most difficult. It will focus on database sharding (shared-nothing). There are many choices for how to implement this and I will rely heavily on the community when deciding how to implement this feature. Like the other two milestones there will be some method for declaring connections to be used as shards. There will be a way to specify models as being sharded. This will likely include declaring models global for common static lookup tables for types that should be replicated between shards. There will be a way to specify how a model is sharded. This feature will also support some kind of balancing for when new shards are added. Lastly, fixtures will be updated to support this new feature.
Give a rough timeline for hitting these milestones.
4/10/09 – 5/22/09 – Community bonding 5/23/09 – 6/19/09 – Milestone 1 – Multi-database connections, model binding to connections 6/20/09 – 7/10/09 – Milestone 2 – Master-slave 7/11/09 – 8/10/09 – Milestone 3 – Sharding 8/11/09 – 8/17/09 – Code cleanup, finalize documentation
How will you measure progress? How will you handle falling behind?
At the beginning of each milestone I will determine a prioritized list of things that need to be added and things that would be nice to have. From this list I will scope out a projected schedule of when I need to complete each item. From this schedule I will be able to gauge my progress. If I start to fall behind, I will start cutting the features that are not required. If I fall drastically behind I will still complete all the needed features for that milestone and push back the date of the following milestone. I imagine it is very possible that I will not complete the third milestone even if I complete the other two on time. In that case I hope to at least have a strong basis for myself or someone else to continue and finish after Google summer of code is completed.
What are the "unknowns" in this project for you? What kind of pitfalls could you run into?
I have not worked with Rails internal before. However, I know how to program in ruby and am very familiar with the design patterns and practices that Rails is built upon. I have not worked with load balancing and do not know how I might implement such a feature for the slaves in the master-slave configuration. I think that my sponsor would be able to assist me in this, though, and I can likely find adequate information about it on the internet and in books. Some of the pitfalls I could fall into would be trying to do too much, falling behind on my project, and missing important requirements. The first two risks can be mitigated by ranking important and planning a schedule which I discussed in the previous section. The last can be mitigated by rigorous testing, to ensure the feature works completely and as expected, and through regular communication with my sponsor and the Rails community.