"randomly" failing CI tests (was: Re: [Rails-core] Re: [CruiseControl] RubyOnRails build 8889 failed)

Joy. That happens on our CI builds at least a few times a month. Could be anything - architecture differences that cause different test ordering, machine speed, slightly different dependency versions (use GemInstaller!), the phase of the moon...

What I usually do is have a separate checkout of the project on the CI box itself, and run the cruise task and individual test manually from the command line. If the failure is reproducible that way (often is), then find out why. If it is not (after several runs), then disable the test under CI. If that makes you feel dirty - good!. You can create a separate build with the test enabled to keep digging into the failure, without spamming or reducing the quality of the 'real' build.

Of course, this approach requires you to have access to the CI box, which presents a problem in this case since Alexey/Thoughtworks owns it.

I've held off saying this in the past, but why doesn't Rails run their CI on a dedicated box that Rails owns (and can pick the architecture for, if redhat-ish things cause regular problems). This could introduce other problems (like nobody caring enough to fix regular ci box issues, as Alexey so kindly does), but it could fix others (like core committers wondering why the build is failing only on CI, but not being able to find out why), and demands on Alexey's time (like getting a stable release branch CI build running).

Initially, this could even be a parallel environment that only emails rails core or smaller audience until we see if it works. I can set this up (I didn't say maintain always) if pointed to a box and given Alexey's current ccrb config files - ccrb is pretty simple, and if it is on a dedicated box, we can invite other interested parties to help support it without major security concerns.

-- Chad