5x speedup in inflections

Hi gents,

I am playing around with an idea to improve the performance of singularize and pluralize for Rails 4.0. In my proof of concept I see some 5x boost, but it relies an assumption that I’d like to consult with you all. Let me explain.

As you know, inflection rules have a lhs which is a string or regexp, and a replacement string as rhs.

The current implementation collects the rules in an array, and to apply them to a particular word the array is iterated. The first pattern that matches is the one applied. In particular, the most common rule (eg append “s” to form a plural), is the last one because most specific rules come first. By default we have +30 rules for singulars and +30 for plurals.

My idea is to build a single regexp with an alternation, detect which segment matches, and apply its replacement. That is, let the regexp engine itself do the loop. Much faster.

I have this working with a quick hack that has shown there’s a potential speedup here. To be able to know which regexp is the one that matched I use named captures. For example (?<_25>(ax)is) would be the alternation corresponding to the regexp /(ax)is/, if that’s the 26th pattern. It won’t win me an elegance award, but it is a hack that works (I could workaround name clashes easily if the user happens to use _25 himself, that’s not important).

Named captures are the only way I’ve seen to be able to build the alternation and at the same time know which part matches. Because existing inflection regexps have captures.

This is the proof-of-concept: https://gist.github.com/1798985.

I believe this is correct as long as the user regexps have no backreferences, because if you have a = /(…)\1/ and b = /(.)\1/ then “xx” matches b, but does not match the union a|b because the backreference \1 in b now refers to the group in a.

OK, so this is the question: do you guys use backreferences in custom inflections? If you didn’t we could consider ruling them out for 4.0 to be able to implement this.

Hi gents,

I am playing around with an idea to improve the performance of singularize
and pluralize for Rails 4.0. In my proof of concept I see some 5x boost,
but it relies an assumption that I'd like to consult with you all. Let me
explain.

As you know, inflection rules have a lhs which is a string or regexp, and a
replacement string as rhs.

The current implementation collects the rules in an array, and to apply
them to a particular word the array is iterated. The first pattern that
matches is the one applied. In particular, the most common rule (eg append
"s" to form a plural), is the *last* one because most specific rules come
first. By default we have +30 rules for singulars and +30 for plurals.

My idea is to build a single regexp with an alternation, detect which
segment matches, and apply its replacement. That is, let the regexp engine
itself do the loop. Much faster.

I have this working with a quick hack that has shown there's a potential
speedup here. To be able to know which regexp is the one that matched I use
named captures. For example (?<_25>(ax)is) would be the alternation
corresponding to the regexp /(ax)is/, if that's the 26th pattern. It won't
win me an elegance award, but it is a hack that works (I could workaround
name clashes easily if the user happens to use _25 himself, that's not
important).

Named captures are the only way I've seen to be able to build the
alternation and at the same time know which part matches. Because existing
inflection regexps have captures.

It's possible to count the number of captures in a given regexp:

    def count_captures regexp
      Regexp.union([regexp, //]).match('').captures.length
    end
    
    p count_captures /(ax)is/ # => 1
    p count_captures /((ax)is)/ # => 2
    p count_captures /(?:(ax)is)/ # => 1

With that information, you can calculate offsets. Calculating offsets
would still result in a linear scan (since you'd have to keep a list of
the offsets, then find the one that matched), and since we can use named
captures on 1.9, I'm not sure an offset based solution would be great.

This is the proof-of-concept: https://gist.github.com/1798985.

Seems good, but you don't need a *args to Regexp.union. :slight_smile:

I believe this is correct *as long as* the user regexps have no
backreferences, because if you have a = /(..)\1/ and b = /(.)\1/ then "xx"
matches b, but does not match the union a|b because the backreference \1 in
b now refers to the group in a.

OK, so this is the question: do you guys use backreferences in custom
inflections? If you didn't we could consider ruling them out for 4.0 to be
able to implement this.

If you can rule out backreferences, it seems possible you could build a
state machine that would perform in linear time to the string length.
It seems though, with your algorithm, as the string gets longer, the
original implementation is faster:

  https://gist.github.com/1805359

Do we know what kind of data we're dealing with? Somewhere between
strings of length 10 and strings of length 100, the original code
outperforms this version. If we're only dealing with shorter strings,
maybe this newer code would be better.

One last comment, the line:

  name = md.names.detect {|n| md[n]}

The length of `names` increases as the number of rules is added to the
system. Which means in the case of a match, this algorithm will still
be O(n), where n is the number of rules defined. We can see this by
testing strings that match first and strings that match last:

  https://gist.github.com/1805536

I'm using AS 3.2.1, so "zombies" will be best case, and a bunch of x's
will be worst.

If we're willing to limit the type of expressions provided to the
inflector, it's possible to construct something that matches the string
regardless of the number of rules (entries in "inflections.rb"), and
proportional to the string length.

MEOW MEOW MEOW MEOW

<3 <3 <3 <3 <3 <3

Hi there!

It’s possible to count the number of captures in a given regexp:

def count_captures regexp

  Regexp.union([regexp, //]).match('').captures.length

end

Yeah, Active Support had that some time ago. I think it was used in routing, but it got removed sometime in the middle of writing the AS guide.

With that information, you can calculate offsets.

Which offsets do you have in mind here?

Counting groups seems a good idea, we could get rid of the ugly _25 named captures I think.

Since individual inflection rules have well-formed regexps (or strings), we can count their groups. Therefore, if we add one extra group enclosing the whole regexp we can keep track of which group index corresponds to which rule. That’d be much cleaner!

This is the proof-of-concept: https://gist.github.com/1798985.

Seems good, but you don’t need a *args to Regexp.union. :slight_smile:

Oh, yes :).

If you can rule out backreferences, it seems possible you could build a

state machine that would perform in linear time to the string length.

The situation now is:

  • Inflections can have arbitrary regexps (modulus backreferences if we disallow them). A priori, they can use any regexp feature, and are typically case-insensitive.

  • Inflections have to be matched in order.

Given that and the fast regexp machine in 1.9, alternation seems to be a better and cost-free approach because what we are talking here is implemented quickly (cost == development cost).

Wouldn’t such a state machine be a rewrite of the very regexp engine?

It seems though, with your algorithm, as the string gets longer, the

original implementation is faster:

https://gist.github.com/1805359

Do we know what kind of data we’re dealing with? Somewhere between

strings of length 10 and strings of length 100, the original code

outperforms this version. If we’re only dealing with shorter strings,

maybe this newer code would be better.

Yes, I think this approach is worthwhile because in practice I believe Rails and Rails applications singularize and pluralize English words.

name = md.names.detect {|n| md[n]}

The length of names increases as the number of rules is added to the

system. Which means in the case of a match, this algorithm will still

be O(n)

Yes, yes, but seems unavoidable an O(n) that still outperforms the original algorithm with the sizes I’ve tested (I don’t mean it does not with lager sets of inflections, only I measured the typical size which is what most people use).

where n is the number of rules defined. We can see this by

testing strings that match first and strings that match last:

https://gist.github.com/1805536

I’m using AS 3.2.1, so “zombies” will be best case, and a bunch of x’s

will be worst.

Yes.

But we are not really interested in arbitrary strings. If we give a 5x boost to the vast majority of users that apply inflection to English words of ordinary length, and some minority using long strings get some penalty, that’s fine I think.

Ah, of course, another assumption here is that regexps are in practice simple.

All the time I have in mind the actual sets of regexps and words to apply them we find in typical Rails applications.

In particular in practice I don’t expect any backtracking explosion due to quantification + alternation. Most regexps are simple and anchored.

Nah, forget that last mail about backtracking. If there’s excessive backtracking in some regexp it will be present in both approaches, the way we build the alternation does not add to it.

Hi there!

It's possible to count the number of captures in a given regexp:
>
> def count_captures regexp
> Regexp.union([regexp, //]).match('').captures.length
> end
>

Yeah, Active Support had that some time ago. I think it was used in
routing, but it got removed sometime in the middle of writing the AS guide.

With that information, you can calculate offsets.

Which offsets do you have in mind here?

Counting groups seems a good idea, we could get rid of the ugly _25 named
captures I think.

Since individual inflection rules have well-formed regexps (or strings), we
can count their groups. Therefore, if we add one extra group enclosing the
whole regexp we can keep track of which group index corresponds to which
rule. That'd be much cleaner!

> > This is the proof-of-concept: https://gist.github.com/1798985.
>
> Seems good, but you don't need a *args to Regexp.union. :slight_smile:
>

Oh, yes :).

If you can rule out backreferences, it seems possible you could build a
> state machine that would perform in linear time to the string length.
>

The situation now is:

* Inflections can have arbitrary regexps (modulus backreferences if we
disallow them). A priori, they can use any regexp feature, and are
typically case-insensitive.

* Inflections have to be matched in order.

Given that and the fast regexp machine in 1.9, alternation seems to be a
better and cost-free approach because what we are talking here is
implemented quickly (cost == development cost).

Wouldn't such a state machine be a rewrite of the very regexp engine?

Ya, but if we're going to put arbitrary restrictions on the type of
matches people can do (i.e. no backreferences), you may as well use an
engine that can execute in O(n) time (n = string length). Otherwise,
what's the point?

It seems though, with your algorithm, as the string gets longer, the
> original implementation is faster:
>
> https://gist.github.com/1805359
>
> Do we know what kind of data we're dealing with? Somewhere between
> strings of length 10 and strings of length 100, the original code
> outperforms this version. If we're only dealing with shorter strings,
> maybe this newer code would be better.
>

Yes, I think this approach is worthwhile because in practice I believe
Rails and Rails applications singularize and pluralize English words.

name = md.names.detect {|n| md[n]}
>
> The length of `names` increases as the number of rules is added to the
> system. Which means in the case of a match, this algorithm will still
> be O(n)

Yes, yes, but seems unavoidable an O(n) that still outperforms the original
algorithm with the sizes I've tested (I don't mean it does not with lager
sets of inflections, only I measured the typical size which is what most
people use).

How did you determine the length of the strings that most people use? I
think seeing this data would be useful in speeding up this code.

Ya, but if we’re going to put arbitrary restrictions on the type of

matches people can do (i.e. no backreferences), you may as well use an

engine that can execute in O(n) time (n = string length). Otherwise,

what’s the point?

I don’t claim this approach is better in complexity theoretically. It isn’t. The point is that for what I think are typical scenarios the regexp engine loops and matches so much faster than in practice this is a potential important speedup.

Also I am aiming at max backwards compat. Disallowing backreferences is not going to be a big deal I think because I doubt they are used much in inflection rules (though I asked to the ML in case there’s some use case that says we have to support them).

Of course if we restrict the regexp features to a very small set we can write a scanner ourselves. But you’d still be doing basically what the regexp engine does, and without the optimizations oniguruma may have implemented after analyzing the regexp.

Because you need to go in order and per each fragment either match the whole pattern or fail fast and try the next fragment right? That’s already what happens with alternation no?

Can you beat that and still support regexps without backreferences?

How did you determine the length of the strings that most people use? I

think seeing this data would be useful in speeding up this code.

I am guessing of course, based on my experience. Also, you do not pluralize sentences or text files no? You pluralize words, that’s the expected input statistically speaking and the one that has to guide us in implementing this I believe.

Nowadays long strings get a performance boost. That does not make sense statistically speaking, English words should be the fast ones.

Indeed, running the benchmark against /usr/share/dict/words gives an overall speedup of more than 7x:

https://gist.github.com/1806049

The bigger the sample the greater the speedup because inflections are the exception. The majority of words are inflected using the last rule, so the difference in the technique to loop is bigger.

This should happen also in real life, the exceptions are rare, in general most words will be applying the last rule.

Ugh, I started the RuleSet class as a replacement of what the inflector does today and prepending rules was actually reversing them in the script :D.

Appending does slow it down a bit, but we are still around 6x.

Interesting. Have you investigated expanding the regular expressions
and doing hash based replacement via gsub!? Since we can know the
replacements in advance, it's possible to compile a hash and use it for
the replacement. If the hash misses, we can fall back to a linear scan.

Here's a quick implementation as an example. We could probably optimize
more of the expressions:

  https://gist.github.com/1806575

Interesting!

I think that approach does not guarantee the order though. Let’s imagine two inflection rules r1, r2. Let’s take a word that matches both, and let’s suppose only r2 is precompilable. If I understand that alternative correctly the algorithm would yield the replacement for r2, rather than r1. True?

I have implemented the variant that uses regular captures rather than named captures, based on the group counter hack.

We have a similar improvement of 6x, and the code reads better I believe:

https://gist.github.com/1808549

Also, I’ve added a sanity check to the script that ensures the current inflector in AS and the one in the script give the same plurals for the entire /usr/share/dict/words.

Yeah, I think we should do this.

OK, the alternation is a red herring.

I have a patch with the new approach implemented and have compared
runtimes: no difference, there's no speedup.

The 6x speedup in the proof of concept happens because the script is
not doing this apparently innocent check over the default 9
uncountable words:

      inflections.uncountables.any? { |inflection| result =~
/\b#{inflection}\Z/i }

which is present in ActiveSupport::Inflections#apply_inflections.

In that method, if you return the inflection right way instead of
doing this check, you see the 6x.

Back in the day I implemented a patch for the inflector that gave (iirc) something like 10x improvement. It was based on the Merb inflector, and was mostly accomplished by more aggressive caching. My argument at the time was that inflections tend to be used over and over again in a given app (mostly by the router).

It required some back-compat breakage, and since the inflector isn’t a large cost even in tiny API-like requests, we dropped it.

Yehuda Katz
(ph) 718.877.1325

So, knowing the key place to improve in the current implementation,
I've rewritten that test this way:

    https://github.com/rails/rails/commit/d3071db1200e90c0533f75b967c4afb519656d00

which exploits the fact that uncountables are not regexps, but words.
It is not entirely backwards compatible because the words are not
required today to be lowercase (though in practice they normally are).

Quoting Xavier Noria <fxn@hashref.com>:

Hi gents,

I am playing around with an idea to improve the performance of singularize
and pluralize for Rails 4.0. In my proof of concept I see some 5x boost,
but it relies an assumption that I'd like to consult with you all. Let me
explain.

As you know, inflection rules have a lhs which is a string or regexp, and a
replacement string as rhs.

The current implementation collects the rules in an array, and to apply
them to a particular word the array is iterated. The first pattern that
matches is the one applied. In particular, the most common rule (eg append
"s" to form a plural), is the *last* one because most specific rules come
first. By default we have +30 rules for singulars and +30 for plurals.

My idea is to build a single regexp with an alternation, detect which
segment matches, and apply its replacement. That is, let the regexp engine
itself do the loop. Much faster.

This is interesting. I tried a similar optimization parsing RSS feeds,
replacing the list of strings in a when clause with a regex. It was 35%
faster. Then I combined all the regex with a group with a unique character in
it to determine which sub-pattern matched. No faster than the list of string
approach. Interesting. There is something I don't understand about regex
performance.

Jeffrey