ARM builds started getting stuck 18 February

https://travis-ci.org/stephengold/Libbulletjme/jobs/653256495

I had good success with ARM builds from about 10 February to about 10:00 UTC on 18 February. After that, they all got stuck during the Git checkout stage. I don’t believe it’s due to a change in my software, since commits that passed before 18 February fail when re-run now. My AMD and macOS builds aren’t seeing this issue.

https://travis-ci.org/stephengold/Libbulletjme/jobs/653905722

3 days later, ARM builds are still failing consistently during Git checkout.

I’d appreciate feedback from anyone using arch: arm64, whether or not they’re seeing this issue.

I tried “dist: bionic” for my ARM build, and that job got stuck while installing APT packages:

https://travis-ci.org/stephengold/Libbulletjme/jobs/655008169

So xenial is working better than bionic for me.

Yeah, you do have a lot of clone failures. I experience them on occasion, but it is like once in 300 or 500 jobs. Restarting the job usually clears the failure for me.

Maybe several things. First, add the following to .travis.yml:

language: cpp
compiler: gcc
git:
    depth: 5
    submodules: false

Second, sync your account with GitHub. Go to travis-ci.org | settings, and select Sync account. Be sure you are at travis-ci.org, and not travis-ci.com.

While on travis-ci.org, go to settings and clear the cache. You are not using ccache, but it probably won’t hurt.

Third, try Bionic:

  - name: GCC, Linux, Aarch64
    os: linux
    arch: arm64
    compiler: gcc
    dist: bionic
    ...

Finally, you might try switching from default-jre to the openjdk-8-jdk package. My thinking is, the default JRE may be mucking with network settings in a bad way. I install openjdk-8-jdk in several machines while testing Android builds on Travis and I have never had a problem. YMMV.

You might also investigate reducing the MTU from 1500 to 1492 or lower. MTU is the sort of thing that can cause unexplained network losses. Especially after the JRE install (and if the JRE changes a setting). Maybe something like:

sudo /sbin/ifconfig eth0 mtu 1492

I would also be interested in learning if you can Restart the job with success.


The clone of your GitHub is lightening fast, so its not like Git is choking on big binary objects. I wish my repos cloned that quickly.

$ time git clone https://github.com/stephengold/Libbulletjme.git
Cloning into 'Libbulletjme'...
remote: Enumerating objects: 771, done.
remote: Counting objects: 100% (771/771), done.
remote: Compressing objects: 100% (371/371), done.
remote: Total 7904 (delta 450), reused 631 (delta 336), pack-reused 7133
Receiving objects: 100% (7904/7904), 4.73 MiB | 14.32 MiB/s, done.
Resolving deltas: 100% (5181/5181), done.

real    0m0.986s
user    0m0.533s
sys     0m0.157s
1 Like

Wow! Thanks for all the suggestions. I will try them and report back.

I never see clone failures in amd64 jobs, but EVERY arm64 job fails—except as noted below. Restarting an arm64 job results in a failure, even if the original job (from before 18 February) succeeded. I’m convinced there must’ve been an environment change around that time.

As noted above, I tried “dist: bionic” on 20 February, and for arm64 it failed while installing APT packages, even earlier than xenial does: https://travis-ci.org/github/stephengold/Libbulletjme/jobs/655008169

I verified (on travis-ci.org) that my account is synced with GitHub, and I re-synced just to be sure. I don’t have any caches set up, so there’s nothing to clear, no button to push.

Setting the Git depth to 5 and submodules to false failed, timing out near the end of the git clone: https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662120584

Specifying openjdk-8-jdk in place of default-jre failed, timing out while installing APT packages: https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662124259

Before reducing the MTU, I removed the amd64 jobs from my matrix, so that arm64 would be the only Travis job run for each push to the repo, and that build … passed! https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662130205

I’m not sure why 3 jobs in parallel is an issue. Perhaps GitHub interprets the burst of traffic as a DOS attack and stops responding?

Now I’m trying to think of a way to run the 3 jobs sequentially. Suggestions?

I’m re-creating my .travis.yml piece-by-piece to see what will break it.

Now I’m trying to think of a way to run the 3 jobs sequentially. Suggestions?

I’m not sure. I’ve never fiddled with this setting.

Maybe ping @BanzaiMan. He usually bails me out when I hit a wall.

I’m re-creating my .travis.yml piece-by-piece to see what will break it.

Yeah, sounds like a good idea.

By the way, you also have arch: ppc64le and arch: s390x, if interested. s390x is a big-endian architecture. You will definitely want to use Bionic with those two.

Okay.

Specifying openjdk-8-jdk in place of default-jre failed, timing out while installing APT packages

This is weird. I’ve never had a problem updating or installing with apt-get. The docs do say to run update like this:

before_install:
  - sudo apt-get update

Or:

addons:
  apt:
    update: true

Maybe you can update before installing (that’s something I do).

1 Like

3 jobs in parallel didn’t break the build. So my hypothesis was wrong.

Adding a “deploy:” section causes it to break. I see a bisection search in my future :wink:

Looks like the culprit is the “secure:” line in the deployment section. The OAUTH token is very long, which might be part of the cause. I’ll investigate other ways to authenticate for GitHub Releases Uploading.

I’m stuck.

By trial and error, I discovered a workaround for the issue:
truncating the OAUTH token in the “deploy” section of the YML script.
Here’s the failing job just before applying the workaround:

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/661310771

Here’s the job after applying the workaround:

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662274752

And here’s a failure after restoring the complete OAUTH token:

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662518118

It’s surprising that a long line in the “deploy” section of the script
can impact the “git checkout” phase of a job, even though deployment is
disabled (due to no tag). Perhaps it’s a buffer overflow.

While this workaround allows builds to succeed, it’s not satisfactory.
Eventually I’ll need to deploy the built binary files. Deployment
requires the complete OAUTH token, so it will fail.

Please work with me to come up with a fix or at least a better workaround.

I’m unwilling to authenticate with my GitHub username and password, which is the only alternative I’ve found.

@BanzaiMan do you have any suggestions?

Ruby project’s arm32 case on arch: arm64 also hit a stack issue too.
https://travis-ci.org/github/ruby/ruby/jobs/663824853#L2856

1 Like

Could be related. There are some very long lines of text in the Travis config file for the Ruby project.

@stephengold

There are some very long lines of text in the Travis config file for the Ruby project.

The long line is possibly okay, because Ruby project .travis.yml is heavily customized with the long lines.

Ruby project’s arm32 case on arch: arm64 also hit a stack issue too.
Travis CI - Test and Deploy with Confidence

Above job was failed due to the total 50 minutes limitation, which has not happened before.

Now new job is failed to due 1 command’s 10 minutes no output limitation that is “No output has been received in the last 10m0s”.

The situation is changed. But it seems that the arch: arm64 is still unstable.

It appears something changed in ARM64: one of my builds succeeded without the workaround!

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/665090271

The improvement (whatever it was) did not persist:

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/665098106