Full text search using tsearch2

Hi,

I implemented a full text search using postgres > 8.3. My performance is very good to index (using triggers/after_filter or crontab time based index update) and to search for something. With Postgres i can index only lexemes, other solutions, like sphinx based or aaf creates the index based in length. So everything bigger than 2/3 chars is indexed. I think to use a lexeme is better to avoid falses positives in a real world.

My questions are:

- Any good reason to dont use tsearch and choose a aaf/*sphinx solution ? - There is any way to make tsearch index the text by length and not by lexeme (i know, probably this question i should ask in a postgres forum)

Regards,

Victor

Victor,

As a long-time PostgreSQL advocate... I'd encourage you to look at using Sphinx and/or UltraSphinx.

Why? Configuration is way less complicated for your development and production environments. I like knowing that I have the advanced features of PostgreSQL available to me... but I'll side with working within Ruby as much as possible on our projects. Sphinx will allow you to evolve your indexing without needing to muck around the database, which should ease pains with deployments as well.

If you get to a point where Sphinx is falling over from not being to handle your load, you can evaluate using Tsearch2, but in the meantime, I'd just consider it premature optimization without much benefits int he short-term.

Having said that... I haven't worked with tsearch2 in 3+ years. :wink:

Good luck!

Robby

Hey Robby,

Thanks for your feedback,

Victor,

As a long-time PostgreSQL advocate... I'd encourage you to look at using Sphinx and/or UltraSphinx.

In fact, I used ultrasphinx, and with an average load, it wasnt uncommon to have to restart ultrasphinx daemons. Another point that, i tried to explain in my question, is the way that the data is indexed. In ultrasphinx, the document is broken in tokens and everything bigger then 2 chars is indexed. So in my opinion, it generates too much "false positives" (sorry guys, i could not find another word).. there is any way to make *sphinx index lexeme ?

Why? Configuration is way less complicated for your development and production environments. I like knowing that I have the advanced features of PostgreSQL available to me... but I'll side with working within Ruby as much as possible on our projects. Sphinx will allow you to evolve your indexing without needing to muck around the database, which should ease pains with deployments as well.

I think when i remove an extra daemon, i'm drying my system architecture, right ? I mean, if you dont really spend some hours reading (ultra|think) sphinx code, it will be always a black box in your system that you dont understand. So am I loosing something ? maybe the learn courve to tsearch and all postgres caveats are bigger then *really* learn how *sphinx based solutions works ?

Best regards,

Victor

In ultrasphinx, the document is broken in tokens and everything bigger then 2 chars is indexed. So in my opinion, it generates too much "false positives" (sorry guys, i could not find another word).. there is any way to make *sphinx index lexeme ?

Sure, you can configure min_infix_len, min_word_len and whatever in your sphinx configuration.

I mean, if you dont really spend some hours reading (ultra|think) sphinx code, it will be always a black box in your system that you dont understand.

Whow, so you read Postgres code before using it? :wink: Same goes for Ruby, Rails, and 99% of what tools we use.

Whow, so you read Postgres code before using it? :wink: Same goes for Ruby, Rails, and 99% of what tools we use.

lol.. no, i mean at least the contrib/tsearch folder i read! ok the PostgreSQL: Documentation: 8.3: Full Text Search as well, and too be honest its better documented then the sphinx plugins :slight_smile:

Thanks for the tip about ultrasphinx, but i would like to know if its possible to do opposite, to make it work with lexemes. Like for example if you look for run, you will find run, runs, ran and running. All of these words are forms of the same lexeme: RUN

I dont wanna be troll or blame the plugins, i heard a lot of good things about thinksphinx, not so good things about ultrasphinx and almost nothing about tsearch, that for me is working fine, except that some developers here comes with the questions: "if i type "te" it dont bring me my "test" post and ultrasphinx does". But well in real world users wont search for "te" and if they search for te, they will find everything like te,teahupo, tea, teta, tendency, etc, etc, etc.

Regards,

Victor

Thanks for the tip about ultrasphinx, but i would like to know if its possible to do opposite, to make it work with lexemes. Like for example if you look for run, you will find run, runs, ran and running. All of these words are forms of the same lexeme: RUN

hmm, I don't know about "ran" but if you enable star search, then ru* wil at least find run, runs, and running. This may be possible too, but I can't say for sure, since I didn't need this kind of search Maybe have a look at the docs: http://www.sphinxsearch.com/docs/current.html#searching

I dont wanna be troll or blame the plugins, i heard a lot of good things about thinksphinx, not so good things about ultrasphinx and almost nothing about tsearch, that for me is working fine, except that some developers here comes with the questions: "if i type "te" it dont bring me my "test" post and ultrasphinx does". But well in real world users wont search for "te" and if they search for te, they will find everything like te,teahupo, tea, teta, tendency, etc, etc, etc.

lo, I agree. It was hard enough to read through lots of tutorials and documents. We started with ultrasphinx at some time back and it worked good enough for our needs. For new projects I would think about changing that to thinksphinx. Especially if I expect the index to be combined from several tables. But for basic search both should do well enough.

Whow, so you read Postgres code before using it? :wink: Same goes for Ruby, Rails, and 99% of what tools we use.

lol.. no, i mean at least the contrib/tsearch folder i read! ok the PostgreSQL: Documentation: 8.3: Full Text Search as well, and too be honest its better documented then the sphinx plugins :slight_smile:

Thanks for the tip about ultrasphinx, but i would like to know if its possible to do opposite, to make it work with lexemes. Like for example if you look for run, you will find run, runs, ran and running. All of these words are forms of the same lexeme: RUN

I dont wanna be troll or blame the plugins, i heard a lot of good things about thinksphinx, not so good things about ultrasphinx and almost nothing about tsearch, that for me is working fine, except that some developers here comes with the questions: "if i type "te" it dont bring me my "test" post and ultrasphinx does". But well in real world users wont search for "te" and if they search for te, they will find everything like te,teahupo, tea, teta, tendency, etc, etc, etc.

sphinx can do stemming. as for searching for prefixes, that's optional.

Fred

thanks for the tips. As documentation thinksphinx has nice railscast about. But geocoding searches for example are not documented until now (or at least, until last week hehe)

Regards,

Victor