Work out kinks in interactions between stages, allow_fail, and fast_finish

In the Rust library imap, I’m looking to reduce the time it takes to get results from CI. My plan was to use a combination of build stages, allow_fail, and fast_finish. The result looks like this: https://github.com/jonhoo/rust-imap/blob/562b77255c61cd1cf813b8034d1c3d3d1325f582/.travis.yml

Some noteworthy things:

I tried to use that .travis.yml file, and ended up with this build. Two things seem to go wrong:

  • In the “test” stage, when all jobs but the Windows one had completed, Travis did not move on to the next stage (“integration”). Instead, it was waiting for the Windows job to finish. I cancelled the Windows job, and then the next stage was immediately started. I then re-started the Windows job, and then Travis again waited for it to complete before it continued to the “lint” stage. I believe this is GH issue #9677.
  • The last stage with only one job (“coverage”) takes a while to run (~16m without cache), and is allowed to fail. However, the build was not marked as successful once the last of the jobs in the “lint” stage succeeded. Furthermore, when I cancelled the coverage job, the entire build was marked as cancelled, not successful as I would have expected given that the last remaining job was marked allow_fail. I’ve filed this as GH issue #10356.

+1 Same here.

1 Like

Now that #9677 has also been closed, this probably also becomes the tracking issue for that. Namely that Travis does not advance to next build stage when fast_finish: true.

1 Like

+1 I’m getting the same thing:
https://travis-ci.org/mmcc007/flutter_architecture_samples/builds/510839158
Does this also happen with matrices?

(In my case, since no longer depend on stages, I may be able to switch from stages to a matrix.)

In the “test” stage, when all jobs but the Windows one had completed, Travis did not move on to the next stage (“integration”). Instead, it was waiting for the Windows job to finish. I cancelled the Windows job, and then the next stage was immediately started. I then re-started the Windows job, and then Travis again waited for it to complete before it continued to the “lint” stage.

By design, only one stage is allowed to run at a time. That what stages are. They run in sequence, one at a time. Even if the only remaining job in a stage has allow_failure set, the next stage will not start until that job is finished. That’s the correct and expected behavior.

Maybe this should be split since there are two or maybe even three issues:

  • Stages wait even if only allow_failures jobs are running
  • Build is not marked successful even if only allow_failures jobs are left
  • Canceling an allow_failures job makes the entire build fail

(I came here because I also have the third issue)