RE lookahead RE problems

I'll start by confessing that this comes originally from something I worked on in Perl, and I've assumed, rightly or wrongly, that regular expressions are regular expressions are regular expressions.

See http://www.ilovejackdaniels.com/cheat-sheets/regular-expressions-cheat-sheet/

The context is that there are a whole pile of patterns that must be preceded by .... .. well, not words or _some_ punctuation. Call them "sort-of zero width". That is, white space, beginning of line and some opening sequences, call the '['and (' and '{' for the sake of the example, are allowed.

I'm trying to put the RE into a 'constant' so that I don't have to keep repeating it - all the DRY stuff about changes and so forth!

I'm trying to use RE's lookahead.

This works in perl

       $STARTWORD = qr/^|(?<=[\s\(\[\{])/m;

There is also the corresponding end word

       $ENDWORD = qr/$|(?=[ \t\n\,\.\;\:\!\?\)])/om;

When I translate these into Ruby I get an error, It doesn't seem to like the lookbehind The error message is

    SyntaxError undefined (?...) sequence: /^|(?<=[\s\(])/

Well, possibly. Or it may be that it I'm having problems when combining it with an actual pattern.

What I've done is separate out the pattern to a constant (and tried to eliminate things that might confuse the parser)

   STARTWORD = %r{^|(?<=[\s\(])}m

An LO! The parser chokes on that. Does it choke because there isn't actually pattern being compared? Well, maybe. If I remove the '%r{' stuff the parser doesn't choke. But it doesn't choke on

    ENDWORD = %r{$|(?=[\s,.;:!?)])}m

And I seem to be getting confused when combining these with other regular expressions because of this inconsistency.

Right now I don't know if the problem is having the REs as constants. Does this make them 'precompiled'?    ENDWORD.type ==> "Regexp" so I'm presuming it is. In which case why can't I precompile STARTWORD?

So: Is it that Ruby can't handle the '?<=' lookbehind assertion ... or what? Am I completely hung up on a wrong track?

The ruby regular expression engine doesn't support look-behind.

As far as I know, look-behind assertions are not handled by Ruby 1.8.* but I think Oniguruma in 1.9 can.

You should ask your question in Ruby-Talk mailing list, which is a better appropriate place for this kind of question.

    -- Jean-François.

John Harrison said the following on 16/01/08 12:43 PM:

The ruby regular expression engine doesn't support look-behind.

Comparison of regular expression engines - Wikipedia

{{ExpletiveDeleted!}}

Suggestions?

Oniguruma

http://oniguruma.rubyforge.org/

This engine is the RegExp engine for Ruby 1.9 and onwards, so you only need this gem for 1.8.x.

Jason

Jason Roelofs said the following on 16/01/08 01:21 PM:

Oniguruma

http://oniguruma.rubyforge.org/

This engine is the RegExp engine for Ruby 1.9 and onwards, so you only need this gem for 1.8.x.

Roll on 1.9 then, because I get pages and pages of error messages when I try installing this gem, starting with

oregexp.c:2:23: error: oniguruma.h: No such file or directory

Now that can't be because I don't have the Ruby sources installed, can it?

The Oniguruma gem is just a wrapper around the actual library. I haven’t installed this myself, though I assumed it would come with the needed code. You just need to install Oniguruma itself, then get the gem.

Jason

The library can be found here: サービス終了のお知らせ

I am trying to get look behind working as well. However, having got past the errors, I am now wrestling with syntax:

** Starting Rails with development environment... Exiting /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `gem_original_require': ./lib/string_extensions.rb:4: undefined (?...) sequence: /[aeiou]|(?<![aeiou])y(?![aeiou])/ (SyntaxError) ./lib/string_extensions.rb:8: undefined (?...) sequence: /![aeiou]|(? <=[aeiou])y(?=[aeiou])/ from /usr/local/lib/ruby/site_ruby/1.8/ rubygems/custom_require.rb:27:in `require' It seems to be complaining about the look-behind and look-ahead assertions in the following code fragment (which origuruma is supposed to support): class String   def vowels     scan(/[aeiou]|(?<![aeiou])y(?![aeiou])/i)   end   def consonants     scan(/![aeiou]|(?<=[aeiou])y(?=[aeiou])/i)   end end According to this reference (サービス終了のお知らせ doc/RE.txt), the look behind and look ahead syntax that I am using appears to be correct (ref section 7. Extended groups) but apparently is not.

<stumped/>

Thanks for all the help everyone. The problem was solved with the help from pullmonkey on Rails Forum! Here is the solution:

Objective:

1. Extract vowels and consonants from a string 2. Handle the conditional treatment of 'y' as a vowel under the following circumstances:      - y is a vowel if it is surrounded by consonants      - y is a consonant if it is adjacent to a vowel

Here is the code that works:

  def vowels(name_str)     reg = Oniguruma::ORegexp.new('[aeiou]|(?<![aeiou])y(?![aeiou])')     reg.match_all(name_str).to_s.scan(/./)   end

  def consonants(name_str)     reg = Oniguruma::ORegexp.new('[bcdfghjklmnpqrstvwx]|(?<=[aeiou])y| y(?=[aeiou])')     reg.match_all(name_str).to_s.scan(/./)   end

(Note, the .scan(/./) can be eliminated to return an array)

The major problem was getting the code to accurately treat "y" as a consonant. The key to solving this problem was to:

1. define unconditional consonants explicitly (i.e., [bcdfghjklmnpqrstvwx]) -- not as [^aeiou] which automatically includes "y" thus OVER-RIDING any conditional reatment of "y" that follows

2. define conditional "y" regexp assertions independently, i.e., "| (? <=[aeiou]) y | y (?=[aeiou])" -- not "|(?<=[aeiou]) y (?=[aeiou])" which only matches "y" preceded AND followed by a vowel, not preceded OR followed by a vowel

HTH.