I’m looking for some validation for some work I’ve done for a client, and I’m open to criticism (“mock me” ? ;^), relevant awareness of similar projects, and alternatives.
When I looked around in about September 2007 for a good scalable search solution for Ruby on Rails, I found the choices lacking. Firstly, none of the solutions seemed to have an option for keeping the reverse indices in-memory across any number of machines I might like to store them. Secondly, many of the solutions seemed too general purpose and heavy weight for my client’s needs (which are basically to search for items from the db, based on tags). But without addressing the first concern, I felt that anything I implemented would not scale to the customer’s needs and aspirations, and that for such an investment, virtually unlimited scale would be mandatory.
Therefore I looked at memcached - well-proven on many large-scale sites for caching, but to my knowledge not used in search. Note that memcached uses an approach wherein the clients all calculate a server based on a given key, such that no central (scale-limiting) controller is required. Having chosen memcached, I next attempted to use various memcached connectors into RoR. I found them at the time (Oct 2007 or so) to be slow and buggy; it didn’t take more than a couple of times of totally corrupting the entire cache to avert my attention from a Ruby approach to using memcached. Meanwhile, I knew from prior experience that the python client for memcached was both fast and reliable. The python memcached client was routinely 3x faster for the tests I ran. Python also seems to be quite fast at set operations.
Getting to the punchline, I used python and memcached, wrapped in twisted, to provide a ReSTful web service api, which is called from RoR to get ALL of the information needed to render search results. The API has been extended to allow the Ruby code to “fire and forget” new indexing info onto a deque (fifo queue), which is processed by a loosely-coupled daemon - overhead to Ruby is about 20ms.
Prior to this approach, the client was using MyISAM full text search. Search results were 10s for smaller search terms (5000 uses), and 20+s for larger search terms (100k+ uses).
With the web service, the search results are routinely returned in 1-2 seconds, and the web service itself returns results to RoR within 100-200ms. Indexing is a challenge - the rank score needs to be updated upon each viewing, but I’ve now gotten that to be almost real-time (5 minutes max). Plus I can re-index the entire database of 1M+ items in about 8 hours. The index is backed up nightly in case of a memcached server failure (we’re using 3). In addition to search, the search web service is used for relatedness and for something like bookmarks.
So, is there anything out there that can touch these results and provide for virtually unlimited scale (no central controller)?
Thanks in advance,
PS: Because of leaks in rmagick and its inferior performance compared to the Python Image Library, I’m also considering a similar approach for generating many different sizes of fairly large (10MB) images. A similar fire and forget web service approach could be used to minimize the impact on the RoR side. Early tests show a 10x speed improvement (even without the fire and forget). Any thoughts there?