extract keywords from string

hi -

i have strings that i need to extract keywords from. the string might
have html tags, urls, etc. i need to extract the keywords from the
string. i imagine i'm not the first guy to have to tackle this
problem. is there a gem i can use or anyone have any ideas how to
approach this?

thanks,
dino

Quoting dino d. <dinodorroco@yahoo.com>:

hi -

i have strings that i need to extract keywords from. the string might
have html tags, urls, etc. i need to extract the keywords from the
string. i imagine i'm not the first guy to have to tackle this
problem. is there a gem i can use or anyone have any ideas how to
approach this?

More detail needed about the keywords. The simple case is keywords regardless
of context, separated by whitespace.

KEYWORDS = %{if else then end case when do def}

str = "if true then false else true end"
str.split.find_all{|s| KEYWORDS.include?(s)}

irb(main):006:0> KEYWORDS = %{if else then end case when do def}
=> "if else then end case when do def"
irb(main):007:0> str = "if true then false else true end"
=> "if true then false else true end"
irb(main):008:0> str.split.find_all{|s| KEYWORDS.include?(s)}
=> ["if", "then", "else", "end"]
irb(main):009:0>

If you need to exclude keywords inside strings, URLs, etc. the solution is
more complex.

HTH,
  Jeffrey

Jeff-

thanks for the reply. i can deal with context in a different method,
in your solution, i still grab "<a>" and "test." and "&wow*&&" as
keywords. i want to send this method a string, and get an array of
letter-only words returned. if you have context ideas, i'd love to
hear those too, but the first step is just harvesting only character
words from strings.

thanks,
dino

Many people are leaning toward Nokogiri (read: http://nokogiri.rubyforge.org/nokogiri/Nokogiri.html).

Agreed. With the disappearance of _why, the future of hpricot is
uncertain.