Multiarch builds (lxd) always time out on failure instead of exit

Marco · March 20, 2020, 2:49pm

Our multiarch build time out, whereas they should fail immediately on any error in the unit tests.

The issue is correlated with the virtualization solution: x86 uses gce vms and the others use lxd.

As travis has planned to switch x86 to lxd as well, it would be nice to fix this regression before that happens.

Here is a test matrix, where builds should either fail or pass within one minute. However, it can be seen that failing lxd builds last 10 minutes (the default timeout).

Marco · April 12, 2020, 1:09am

This is still an issue: https://travis-ci.org/github/MarcoFalke/debug_travis/builds/673940545

fingolfin · April 13, 2020, 12:28pm

I’ve been stuck on this problem for months: basically right after the arm/s390x/ppc64le support was announced, I made https://github.com/gap-system/gap/pull/3744 and for all three architectures, the tests end in a timeout. I’ve changed our executable to abort and inserted a bunch of echo FOOBAR NNN statements in the script orchestrating the CI run to narrow it down, and it’s always the same: our exec exits with a non-zero code (or crashes via abort), and then… nothing; the shell script that launched it does not continue to execute, the VM just hangs until the timeout kills it.

Sadly that means we can’t use these; I was looking in particular forward to having s390x available so that we finally could do CI tests on a big endian machine. Ah well

fingolfin · April 13, 2020, 2:16pm

Here is a list of other threads that discuss the same or closely related issues:

See also https://github.com/travis-ci/travis-build/pull/1816

I’ve been able to workaround the issue by adding || travis_terminate $? to each script line in out .travis.yml

Marco · April 13, 2020, 6:29pm

I’ve reduced the reproducible failure to a single-line script in the travis yaml:

https://travis-ci.org/github/MarcoFalke/debug_travis/builds/674533618

fingolfin · April 13, 2020, 7:04pm

Yeah, any non-zero exit code stumps the whole thing.

Marco · April 13, 2020, 7:59pm

set -o errexit; is also required, I believe

fingolfin · April 13, 2020, 9:12pm

@Marco indeed, yes (i.e., set -e )

Marco · April 23, 2020, 10:55am

On s390x, the machine will run idle up to 6 hours after a failure, see https://travis-ci.org/github/bitcoin/bitcoin/jobs/678415052

Not only does it block our whole queue, this also seems wasteful of travis resources.

native-api · October 25, 2020, 2:09am

Don’t set errexit in the main shell. This shell also runs the build’s internal logic and the sudden exit terminates it abnormally.

If a failing command should terminate the build, put it into a section other than script: – i.e. install. The script: section is designed to run tests – i.e. stuff that you want to run to completion even if something fails.

native-api · October 26, 2020, 5:31am

2 posts were split to a new topic: Build shows as successful even though the sole job failed

Topic		Replies	Views
ARM64/AARCH build is not detected as finished and then fails Multi CPU Architecture build-env , travis-build	3	777	February 25, 2020
Builds hang, with output truncated mid-line Multi CPU Architecture	14	1795	May 16, 2020
AMD builds either finish in 12 minutes or hang / time out, randomly; only happens if more than 6 jobs in build Linux build-env , travis-build	16	874	July 4, 2020
Arm job timing out during initialization Multi CPU Architecture	3	897	October 23, 2020
"exit 0" cannot exit successfully on ARM Multi CPU Architecture build-env , travis-build , bug	9	2325	December 2, 2019

Multiarch builds (lxd) always time out on failure instead of exit

Related topics