Allowed to fail job, if canceled, the whole build is marked as canceled

Before pasting any link here, I just would like to be sure what I’m experiencing is correct or not. For me it seems to be a bug, since about the job cancel behaviour I couldn’t find too many things in the documentation.

My expectation is that, if I allow a specific job to fail, it should not modify the outcome of the build either if it is failed or canceled.
However what I experience, if an “allowed to fail” job is canceled, the whole build is going to be marked as canceled, despite having all the other jobs passed. I expect the build to be marked as passed, since that job was allowed to fail. I mean, it the job is allowed to fail, I don’t think it really matters if someone canceled it anyway.

Is it how it is supposed to work?
Do we have something like allow_cancels for travis config?

Kind regs,
Andras Popovics

This is by design. Whether a job is allowed to fail or not, it didn’t run to completion so you don’t know if it would pass or fail if you let it.
Even though it wouldn’t affect the build’s pass/fail status, the build gives incomplete information about what works in your codebase and what doesn’t.

You can restart that job and let it run to completion – then the build’s status will change to passed/failed as normal.

Hi,

Thanks for the reply. The problem I’m trying to solve is something what many wants, as I saw it in this forum and in other Google results.

I’m trying to have the Travis CI to work with a monorepo, where we don’t necessarily need to run every job, since not everything were changed. Unfortunately, as far as I know, with Travis I can’t create jobs dynamically, so I thought my only option were to cancel those jobs which are not needed and are allowed to fail. (Obviously at the end of the build, I’m checking in a last job, that everything which were supposed to not fail, really passed).

You wrote:

Even though it wouldn’t affect the build’s pass/fail status, it gives incomplete information about what works in your codebase and what doesn’t.

But according to what I see, if I cancel any job (which is allowed to fail), it do affect the build’s pass/fail status, since the build is not passing, but it is canceled.

You can restart that job and let it run to completion – then the build’s status will change to passed/failed as normal.

I don’t understand how could it solve my problem? :thinking:

Thanks,
Andras

  1. The best idea I have for a monorepo is described as “workaround” in How to skip jobs based on the files changed in a subdirectory? - #15 by native-api

    This looks like the most natural way since in monorepos, what needs to be done is decided by intelligent build tools which can do it with much finer granularity than you can achieve by manually splitting your workload into jobs.

  2. Alternatively, you can check at job start if it needs to be run and quit it early (with travis_terminate or by making the rest of your logic a no-op if you need further automatic steps like saving the cache to run) if it doesn’t. Currently, you can only do this as early as at before_install: (but still, it’s better than nothing). Run a chunk of user code before the stock installation logic is an FR that would let you do it earlier with minimal wasted time.

I see, thanks.

We are using Nrwl’s Nx to determine which part of the build needs to run. We already tried to play with the alternative solution you mentioned:

  • every job checks whether it needs to run or not. If not, it “short circuits” itself with 0 exit code, otherwise it let the job proceed

However since we are using nx, which is an npm package, we need the node_modules restored from cache. Since we are talking about a monorepo, with dependency single version policy, the node_modules can be heavy. Even with combining Travis cache + our own S3 based cache, in best case scenario, the job needs 2.5 minutes on average until it reaches the point, when it can decide whether it needs to stop or can proceed.

In itself this 2.5 minutes doesn’t sound that much. But we are talking about a monorepo, with as much granularity as possibly. And with a monorepo we have a lot PR and builds daily. At the moment we have around 30 jobs (mostly parallel) per build which have this decision making logic, and we expect to have soon 50 or more. So you can see, if we multiply these together, it means we already have a lot of time wasted, because we can’t make this decision earlier.

I’m afraid the natural way what you mentioned wouldn’t work for us either, since our goal is to make as many things parallel as possible (which as you mentioned can be hard to achieve, especially with communication between jobs which tries to distribute the work), and to run only those things which are absolutely needed.

My preferred solution would have been:

  1. I created a so called “Job Manager” job as the first job of the build, which decided which job was needed, and which was not. Using the Travis API V3, I cancelled every not necessary job for that particular build.
  2. At the very end of the build I had also a “Build evaluator” job, which checked whether every job passed which was needed for this build.
  3. All the other jobs in-between these two, were allowed to fail, so they didn’t stop the build, even if some of them were canceled.
    This was when I realised, that despite every job was passing (except for the canceled ones obviously), the build was marked as a canceled one at the end.

I feel that I was so close to the monorepo friendly CI solution with the current design of Travis, but because of the lack of this feature, I won’t be able to achieve it.

Do you think it would make any sense to enrich the current design of Travis CI with such a feature? To somehow let “allowed to fail jobs to be canceled without the failed build mark as penalty”. Or having an “allow_cancel” configuration (similar to “allow_failures”) for the job matrix?

I assume, this could make Travis more usable for monorepos. Obviously the best would be to be able to create jobs in a dynamic way, but I think that one is a much bigger chunk of work.

But if you decide whether a job needs to run in a “job manager” job, you don’t really need to pull Nx in each job to be able to cancel it. You only need to send that job some signal – e.g. by placing a flag into workspace contents.
Without Run a chunk of user code before the stock installation logic, this probably means that you cannot use much of Travis’ stock logic (restore package cache, install custom software) and will have to do the same stuff manually.


The other “natural way” I suggested in How to skip jobs based on the files changed in a subdirectory? - #15 by native-api is to use “beefier build machines” that would be able to run lots of tasks in parallel. Note that whenever you split a build into jobs by hand, you’re taking a tax of an artificial barrier (not unlike nested makefiles) and having to set up the environment in every job.

Unless Travis can provide you with those on demand (they never commented either way), you’ll need to purchase Travis Enterprise which would allow you to build on your own worker machines of however much power you need. If you need to run 50 jobs in parallel every build and lots and lots of builds per day, you seem to be a large-scale enough consumer for that.