Multiarch builds (lxd) always time out on failure instead of exit

Our multiarch build time out, whereas they should fail immediately on any error in the unit tests.

The issue is correlated with the virtualization solution: x86 uses gce vms and the others use lxd.

As travis has planned to switch x86 to lxd as well, it would be nice to fix this regression before that happens.

Here is a test matrix, where builds should either fail or pass within one minute. However, it can be seen that failing lxd builds last 10 minutes (the default timeout).

2 Likes

This is still an issue: https://travis-ci.org/github/MarcoFalke/debug_travis/builds/673940545

I’ve been stuck on this problem for months: basically right after the arm/s390x/ppc64le support was announced, I made https://github.com/gap-system/gap/pull/3744 and for all three architectures, the tests end in a timeout. I’ve changed our executable to abort and inserted a bunch of echo FOOBAR NNN statements in the script orchestrating the CI run to narrow it down, and it’s always the same: our exec exits with a non-zero code (or crashes via abort), and then… nothing; the shell script that launched it does not continue to execute, the VM just hangs until the timeout kills it.

Sadly that means we can’t use these; I was looking in particular forward to having s390x available so that we finally could do CI tests on a big endian machine. Ah well :frowning:

Here is a list of other threads that discuss the same or closely related issues:

See also https://github.com/travis-ci/travis-build/pull/1816

I’ve been able to workaround the issue by adding || travis_terminate $? to each script line in out .travis.yml

I’ve reduced the reproducible failure to a single-line script in the travis yaml:

https://travis-ci.org/github/MarcoFalke/debug_travis/builds/674533618

Yeah, any non-zero exit code stumps the whole thing.

set -o errexit; is also required, I believe

1 Like

@Marco indeed, yes (i.e., set -e )

On s390x, the machine will run idle up to 6 hours after a failure, see https://travis-ci.org/github/bitcoin/bitcoin/jobs/678415052

Not only does it block our whole queue, this also seems wasteful of travis resources.

1 Like

Don’t set errexit in the main shell. This shell also runs the build’s internal logic and the sudden exit terminates it abnormally.

If a failing command should terminate the build, put it into a section other than script: – i.e. install. The script: section is designed to run tests – i.e. stuff that you want to run to completion even if something fails.

2 posts were split to a new topic: Build shows as successful even though the sole job failed