mangling search terms

I've been working on a pet project and have just started implementing
full-text searching with acts_as_xapian. It's working pretty well but I'm
having trouble getting some of the bells and whistles to work.

First is the spelling correction. If it feels that there are incorrectly
spelt words, it provides an array of the correct spellings, but without
reference to which words it is correcting.

So if I enter the search term "the cat adn the dog", it will give an array
["and"] which is useless in a gsub because it can't tell what it should be
replacing. I want to be able to say "did you mean 'the cat *and* the dog'?"
but I can't work out how to manipulate the string.

The second puzzle is regarding highlighting the search terms. When you
follow a link to one of the results, it appends the search term to the query
and uses TextHelper::highlighter to mark those words. The problem is that it
is expecting an array, not a string. So I split the string by spaces, but
what about parts of the query that were enclosed in quotes?

I have found it impossible to mangle a complex query such as:

"null pointer" undefined "static char array"

So it can be passed a query parameter and then decoded again for the
highlighter. I've tried all sorts of regexp, splits and joins but it's just
given me a headache.

I know people have done this before so I'm hoping someone can give me some
pointers. Let me know if I can provide any more information to explain
myself better.

Many thanks

Matt

Ok I think I spoke too soon. Even after rebuilding and updating the indicies
several times, fulltext searching doesn't manage to search the entire body
of text, only the first few lines. Investigation shows that the Xapain
google groups list is almost pure spam and isn't active at all.

I guess I've chosen a duff technology to use, so I'll need to switch. Can
anyone suggest the current favorite for fulltext searching?

I don't really care about the spelling correction or highlighting of search
terms (it interferes with my caching), just a simple search.

Thanks

Matt

Take a look at ferret as a starting point. If you need more "oomph" you can
move to either sphinx or solr. The "Advanced Rails Recipes" book has
examples of all 3.

Thanks i'll take a look.

Matt

Excellent, I managed to replace xapian with ferret and have it searching
properly (it seems) within about 10 minutes.

There's still one thing I would like to do if possible but it seems I might
be out of luck.

This is for a custom made CMS with many pages. Each page uses fragment
caching which expires when a page is edited. Bearing that in mind, how can I
implement search term highlighting? I almost got it working with xapian before
I realised that the caching would conflict with it.

I understand that it might not be possible but it would be nice. Maybe I
could just have it highlight the words on the results page, rather than on
the page itself.

I'm also looking for a way to display an exerpt of the sentence containing
the search terms but for now I'll just show the page title.

Any more help from you guys is greatly appreciated.

Thanks

Matt

Glad that you now have a working solution for you, but just for the
record (and for future searchers), although the acts_as_xapian project
seems to have been unmaintained since July, there are two other active
projects building Ruby wrappers on top of Xapian: xapit (http://
github.com/ryanb/xapit/blob/master/README.rdoc) and xapian-fu (http://
github.com/johnl/xapian-fu). We maintain a listing of current
wrappers in the Xapian wiki at http://trac.xapian.org/wiki/FAQ/RubyWrappers

I don't know why acts_as_xapian was having problems giving you
spelling corrections in a useful way: Xapian's interface returns the
suggested spell-corrected query, which seemed to be what you wanted,
so I don't know why acts_as_xapian wasn't doing this.

Thanks for the reply, unfortunately I'm now having problems with ferret (see
my other recent post) and ruby 1.9.1 so I might end up looking at Solr and
Sunspot if I can't resolve it.

As for xapian, I'm sure there are implementations that work, but I'm not
sure if I'll try it again, I don't really know why. As for the spelling,
it's interesting that xapian itself returns a corrected version of the
entire query. that would have worked perfectly for me, unfortunately,
acts_as_xapian only returned the single corrected word which made it far
less useful.

Well thanks for taking the time to reply but I think this thread is redunant
now until I can fix ferret or decide to move over to Solr or something else.

Thanks

Matt