Not really a Rails post but...

Dale_Cook · June 20, 2009, 12:25am

This isn't really a Rails post but this group has given such great responses to a range of questions over the years I though I'd ask anyway.

I've been tasked with writting a Rails app that takes a block of text, anywhere from about 50 characters up to 300 characters - about a sentance or two, and compares it to other similar sized blocks of text and compares how similar they are, content wise and contextually. It doesn't have to be perfect but it has to be reasonably close. I was thinking that it would be good to be able to get a numerical score depending on how close they were (90 is really close, 20 is not very close at all) but I'm certainly open to ideas.

Anyway, the problem is I have no idea how to do this or even where to look to get started. I really doubt that there is already a Ruby library to do this (although that would rock) , or a Rails plug-in (although that would rock really hard) so I'm more looking for ideas on what I should be reading to get a sense on how to start on this. Anything would help, theoretical ideas, technical papers, Wikipedia articles, anything.

Anyway, any suggestions are greatly appreciated. Dale

Scott · June 20, 2009, 12:57am

This might be useful:

http://engtagger.rubyforge.org/

11175 · June 20, 2009, 2:28pm

PeteSalty wrote:

This isn't really a Rails post but this group has given such great responses to a range of questions over the years I though I'd ask anyway.

That's not a good reason to ask off-topic questions. [...]

Anyway, the problem is I have no idea how to do this or even where to look to get started.

A quick Google search would have readily turned up the phrase "Levenshtein distance".

Good luck. Next time, though, keep your queries on topic and do a little more research of your own before posting!

Best,

Tim_Rand · June 21, 2009, 5:43am

Hi Dale, It is a good ruby question (and rails is a ruby framework--so I think it is fair game). I don't know that the Levenshtien suggestion will be that helpful. (You can try it with require 'text', since it is part of the built in text module) The algorithm it uses is based on the number of changes that need to be made in one string to get a second (deletions, substitions, and additions). It is nice for comparing words for possible misspellings etc. But in your case, if you want to compare content, you need an approach that focuses on word frequency and context. Here Levenshtien is not the right tool (at least, doesn't seem so to me).

Endtagger looks interesting module--could be useful.

Here is a third proposal--What about breaking the text into chunks-- one-word fragments, two-word fragments, three-word fragments and then doing a subtraction of one array of fragments from the other (fragments generated from 2nd sentence)? A perfect match would leave an empty array, while less perfect matches would leave more fragments. It might take some tinkering to get the algorithm tuned right, but how much tinkering depends on how much information you need to get from the poorer matches. The single word fragments compare for content, the double and triple word fragments would compare context. Good luck, Tim

11175 · June 21, 2009, 6:02am

timr wrote:

Hi Dale, It is a good ruby question (and rails is a ruby framework--so I think it is fair game).

[...]

I daresay most people on this list would disagree. There are other lists and newsgroups for general Ruby questions. This is specifically a Rails list, not a general Ruby one. The fact that Rails is written in Ruby is irrelevant to the issue.

Best,

Dale_Cook · June 22, 2009, 5:32pm

Thanks to everyone for the suggestions, especially Tim for taking the time to write up what I think is a really good idea. I'm going to take a crack at this and if it works out and I can get the algorithm tuned we'll release it as a gem (or possibly a plugin).

Once again, thanks for all the suggestions, it's what makes this list great.

Dale

GS11 · June 23, 2009, 9:05am

I just wanted to chime in and say that I've gained utility from dale's post and welcome posts such as this as I find them the most interesting and thought provoking kind of posts. Before reading this thread i would have had no idea what a Levenshtein distance is, but now i can file that away for future reference if i ever come across such a problem.

and so I really don't know, Marnen, how you can "daresay most people on this list would disagree." Do you know a majority of the people on this list? I'm not saying I do...in fact, I'm saying i don't, and don't suppose to speak for them. On your part, it seems like a rather bold, unsupportable statement to speak for a majority of the opinions of what I understand to be a large amount of fairly opinionated people, as both you and I seem to be.

Just to be clear, Marnen, I'm not disagreeing with your desire to keep the list on topic, and think you are completely valid in having your own opinion that the question does not belong on the list, but I think that attempting to speak for a majority of the list is...i'm not sure...philosophically incorrect? I would have preferred if you had spoken for yourself the way timr did when he said: "It is a good ruby question (and rails is a ruby framework--so I think it is fair game)."

Dale, as for your problem, i'm not sure i understand what you necessarily mean by context, but i get the idea that you are being asked to compare the meaning of the two sentences...or no? If that is the case, then loose ideas i'm having are: 1) take a sentence from each and use a dictionary reference to count the nouns, verbs, and adjectives. 2) take a word from one sentence and use a thesauraus to see if it matches up with a word from the other sentence

cheers, -Gabe

Topic		Replies	Views
Counter? rubyonrails-talk off-topic	0	207	January 3, 2007
distance_of_time_in_words hardcoded strings should be separated for easier localization rubyonrails-talk	0	89	September 18, 2007
Help with ActiveRecord rubyonrails-core	0	145	January 7, 2019
RE: [Rails] Re: Shorter version for this newbie code...? rubyonrails-talk	0	140	August 20, 2006
looking for a somewhat old quote rubyonrails-talk	2	130	November 1, 2008

Not really a Rails post but...

Related topics

More Resources