Job passes under container-based infrastructure, fails under vitrual machine infrastructure


#1

I have a job that passes under the container-based infrastructure (with sudo: false) and fails under the virtual machine infrastructure (without sudo: false).

The job takes 35-40 minutes to run in the container-based infrastructure.
To prevent Travis from terminating the job, I either use travis_wait or I supply a flag so that it produces output continuously and Travis does not time it out.

I have four variants of the job, in different branches:

  • using the VM infrastructure and travis_wait (master branch) – TIMES OUT
  • using the VM infrastructure and output (branch no-travis-wait) – TIMES OUT
  • using the container-based infrastructure and travis_wait (branch use-container) – SUCCEEDS
  • using the container-based infrastructure and output (branch no-travis-wait-use-container) – SUCCEEDS

You can see the 4 jobs on Travis:
https://travis-ci.org/typetests/daikon-typecheck/branches
and on GitHub (in case you want to diff the projects):

The master branch with VM infrastructure and travis_wait terminates after 50 minutes and has no useful information in its log.
I can compare logs for the no-travis-wait branch with VM infrastructure and output. It terminates after 40 minutes, and the most recent two runs halted after exactly the same amount of output. (This is perhaps suspicious.)

Can you suggest what I should do to diagnose the problem?

Thanks for your help!

-Mike


#2

Your script gives virtually no feedback what it is doing, and what might be taking a long time to execute.

travis_wait works fine if the command you give it finishes in time, but otherwise it will simply hides the output. I would first remove travis_wait entirely, and write to STDOUT from a background process (as simple as function dots () { while true; do echo "."; sleep 60; done }; dots &), make your script more verbose (say, with set -x or set -v), and add time stamps (date may be sufficient) to see what exactly is slow. (If it turns out to be any of your make targets, then I don’t know what else to suggest, besides profiling what your Java process(es) might be doing.

We can also turn on the debug feature for this repo, so you can take a look more interactively. Please email support@travis-ci.com if you are so inclined.


#3

Thanks for your reply; I appreciate it.

My message gave links to 4 different branches: 2 of the branches use travis_wait, and 2 of the branches print progress output.
Your response linked to just one branch that used travis_wait and doesn’t give much feedback.
Did you look at the other 3 branches?

For concreteness, here are links to logs to the corresponding job in builds of all 4 branches:
https://travis-ci.org/typetests/daikon-typecheck/jobs/462554102
https://travis-ci.org/typetests/daikon-typecheck/jobs/462554056
https://travis-ci.org/typetests/daikon-typecheck/jobs/462550002
https://travis-ci.org/typetests/daikon-typecheck/jobs/462542098

Each of the jobs does exactly the same computation, of which the long-running step is a single call to javac (the java compiler) that takes 25-26 minutes to run.

The computation is completely identical, and failure is not dependent on use of travis_wait. The only differences between passing and failing jobs is VM vs container infrastructure, and the VM infrastructure fails.

Let’s focus just on the versions with progress output, whose branch names contain “no-travis-wait”.

In branch “no-travis-wait” (which uses the VM infrastructure) javac processes 2253 files in 25 minutes, then Travis terminates the job due to no output, after a total of 40 minutes.
Here are the first and last timestamps:
Note: Sun Dec 02 21:01:15 UTC 2018
Note: Sun Dec 02 21:25:56 UTC 2018
On multiple runs, the job always fails after 2253 files.

In branch “no-travis-wait-use-container”, javac processes all 3493 files. The first 2253 take 18 minutes, the next one (which is where the timeout seems to happen on the VM infrastructure) takes 16 seconds, and the remaining ones take 8 more minutes, for a total of 26 minutes.
Here are the relevant timestamps:
Note: Sun Dec 02 22:30:39 UTC 2018
Note: Sun Dec 02 22:48:59 UTC 2018
Note: Sun Dec 02 22:48:22 UTC 2018
Note: Sun Dec 02 22:56:29 UTC 2018

The only significant difference I noticed in the logs is that the container version contains many occurrences of
Picked up _JAVA_OPTIONS: -Xmx2048m -Xms512m
as in this output:

  $ java -Xmx32m -version
  Picked up _JAVA_OPTIONS: -Xmx2048m -Xms512m
  java version "1.8.0_151"
  Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
  Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
  $ javac -J-Xmx32m -version
  Picked up _JAVA_OPTIONS: -Xmx2048m -Xms512m
  javac 1.8.0_151

The VM logs do not contain the “Picked up _JAVA_OPTIONS” lines, so perhaps this is a difference in the configuration between the two versions.

Does this additional information help you?

Some behavior difference in some program is being exposed by a difference between the VM and container-based configurations. Can you provide any suggestions based on experience by other users who ran into similar problems?


#4

A ping about this message.