I see, thanks.
We are using Nrwl’s Nx to determine which part of the build needs to run. We already tried to play with the alternative solution you mentioned:
- every job checks whether it needs to run or not. If not, it “short circuits” itself with 0 exit code, otherwise it let the job proceed
However since we are using nx, which is an npm package, we need the node_modules restored from cache. Since we are talking about a monorepo, with dependency single version policy, the node_modules can be heavy. Even with combining Travis cache + our own S3 based cache, in best case scenario, the job needs 2.5 minutes on average until it reaches the point, when it can decide whether it needs to stop or can proceed.
In itself this 2.5 minutes doesn’t sound that much. But we are talking about a monorepo, with as much granularity as possibly. And with a monorepo we have a lot PR and builds daily. At the moment we have around 30 jobs (mostly parallel) per build which have this decision making logic, and we expect to have soon 50 or more. So you can see, if we multiply these together, it means we already have a lot of time wasted, because we can’t make this decision earlier.
I’m afraid the natural way what you mentioned wouldn’t work for us either, since our goal is to make as many things parallel as possible (which as you mentioned can be hard to achieve, especially with communication between jobs which tries to distribute the work), and to run only those things which are absolutely needed.
My preferred solution would have been:
- I created a so called “Job Manager” job as the first job of the build, which decided which job was needed, and which was not. Using the Travis API V3, I cancelled every not necessary job for that particular build.
- At the very end of the build I had also a “Build evaluator” job, which checked whether every job passed which was needed for this build.
- All the other jobs in-between these two, were allowed to fail, so they didn’t stop the build, even if some of them were canceled.
This was when I realised, that despite every job was passing (except for the canceled ones obviously), the build was marked as a canceled one at the end.
I feel that I was so close to the monorepo friendly CI solution with the current design of Travis, but because of the lack of this feature, I won’t be able to achieve it.
Do you think it would make any sense to enrich the current design of Travis CI with such a feature? To somehow let “allowed to fail jobs to be canceled without the failed build mark as penalty”. Or having an “allow_cancel” configuration (similar to “allow_failures”) for the job matrix?
I assume, this could make Travis more usable for monorepos. Obviously the best would be to be able to create jobs in a dynamic way, but I think that one is a much bigger chunk of work.