extract keywords from string

hi -

i have strings that i need to extract keywords from. the string might have html tags, urls, etc. i need to extract the keywords from the string. i imagine i'm not the first guy to have to tackle this problem. is there a gem i can use or anyone have any ideas how to approach this?

thanks, dino

Quoting dino d. <dinodorroco@yahoo.com>:

hi -

i have strings that i need to extract keywords from. the string might have html tags, urls, etc. i need to extract the keywords from the string. i imagine i'm not the first guy to have to tackle this problem. is there a gem i can use or anyone have any ideas how to approach this?

More detail needed about the keywords. The simple case is keywords regardless of context, separated by whitespace.

KEYWORDS = %{if else then end case when do def}

str = "if true then false else true end" str.split.find_all{|s| KEYWORDS.include?(s)}

irb(main):006:0> KEYWORDS = %{if else then end case when do def} => "if else then end case when do def" irb(main):007:0> str = "if true then false else true end" => "if true then false else true end" irb(main):008:0> str.split.find_all{|s| KEYWORDS.include?(s)} => ["if", "then", "else", "end"] irb(main):009:0>

If you need to exclude keywords inside strings, URLs, etc. the solution is more complex.

HTH,   Jeffrey

Jeff-

thanks for the reply. i can deal with context in a different method, in your solution, i still grab "<a>" and "test." and "&wow*&&" as keywords. i want to send this method a string, and get an array of letter-only words returned. if you have context ideas, i'd love to hear those too, but the first step is just harvesting only character words from strings.

thanks, dino

Many people are leaning toward Nokogiri (read: http://nokogiri.rubyforge.org/nokogiri/Nokogiri.html).

Agreed. With the disappearance of _why, the future of hpricot is uncertain.