ARM builds started getting stuck 18 February

stephengold · February 21, 2020, 2:13am

https://travis-ci.org/stephengold/Libbulletjme/jobs/653256495

I had good success with ARM builds from about 10 February to about 10:00 UTC on 18 February. After that, they all got stuck during the Git checkout stage. I don’t believe it’s due to a change in my software, since commits that passed before 18 February fail when re-run now. My AMD and macOS builds aren’t seeing this issue.

stephengold · February 22, 2020, 8:36pm

https://travis-ci.org/stephengold/Libbulletjme/jobs/653905722

3 days later, ARM builds are still failing consistently during Git checkout.

I’d appreciate feedback from anyone using arch: arm64, whether or not they’re seeing this issue.

stephengold · February 25, 2020, 7:05pm

I tried “dist: bionic” for my ARM build, and that job got stuck while installing APT packages:

https://travis-ci.org/stephengold/Libbulletjme/jobs/655008169

So xenial is working better than bionic for me.

noloader · March 13, 2020, 4:49am

Yeah, you do have a lot of clone failures. I experience them on occasion, but it is like once in 300 or 500 jobs. Restarting the job usually clears the failure for me.

Maybe several things. First, add the following to .travis.yml:

language: cpp
compiler: gcc
git:
    depth: 5
    submodules: false

Second, sync your account with GitHub. Go to travis-ci.org | settings, and select Sync account. Be sure you are at travis-ci.org, and not travis-ci.com.

While on travis-ci.org, go to settings and clear the cache. You are not using ccache, but it probably won’t hurt.

Third, try Bionic:

  - name: GCC, Linux, Aarch64
    os: linux
    arch: arm64
    compiler: gcc
    dist: bionic
    ...

Finally, you might try switching from default-jre to the openjdk-8-jdk package. My thinking is, the default JRE may be mucking with network settings in a bad way. I install openjdk-8-jdk in several machines while testing Android builds on Travis and I have never had a problem. YMMV.

You might also investigate reducing the MTU from 1500 to 1492 or lower. MTU is the sort of thing that can cause unexplained network losses. Especially after the JRE install (and if the JRE changes a setting). Maybe something like:

sudo /sbin/ifconfig eth0 mtu 1492

I would also be interested in learning if you can Restart the job with success.

The clone of your GitHub is lightening fast, so its not like Git is choking on big binary objects. I wish my repos cloned that quickly.

$ time git clone https://github.com/stephengold/Libbulletjme.git
Cloning into 'Libbulletjme'...
remote: Enumerating objects: 771, done.
remote: Counting objects: 100% (771/771), done.
remote: Compressing objects: 100% (371/371), done.
remote: Total 7904 (delta 450), reused 631 (delta 336), pack-reused 7133
Receiving objects: 100% (7904/7904), 4.73 MiB | 14.32 MiB/s, done.
Resolving deltas: 100% (5181/5181), done.

real    0m0.986s
user    0m0.533s
sys     0m0.157s

stephengold · March 13, 2020, 6:40am

Wow! Thanks for all the suggestions. I will try them and report back.

stephengold · March 13, 2020, 9:07pm

I never see clone failures in amd64 jobs, but EVERY arm64 job fails—except as noted below. Restarting an arm64 job results in a failure, even if the original job (from before 18 February) succeeded. I’m convinced there must’ve been an environment change around that time.

As noted above, I tried “dist: bionic” on 20 February, and for arm64 it failed while installing APT packages, even earlier than xenial does: https://travis-ci.org/github/stephengold/Libbulletjme/jobs/655008169

I verified (on travis-ci.org) that my account is synced with GitHub, and I re-synced just to be sure. I don’t have any caches set up, so there’s nothing to clear, no button to push.

Setting the Git depth to 5 and submodules to false failed, timing out near the end of the git clone: https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662120584

Specifying openjdk-8-jdk in place of default-jre failed, timing out while installing APT packages: https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662124259

Before reducing the MTU, I removed the amd64 jobs from my matrix, so that arm64 would be the only Travis job run for each push to the repo, and that build … passed! https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662130205

I’m not sure why 3 jobs in parallel is an issue. Perhaps GitHub interprets the burst of traffic as a DOS attack and stops responding?

Now I’m trying to think of a way to run the 3 jobs sequentially. Suggestions?

stephengold · March 13, 2020, 10:44pm

I’m re-creating my .travis.yml piece-by-piece to see what will break it.

noloader · March 13, 2020, 10:59pm

Now I’m trying to think of a way to run the 3 jobs sequentially. Suggestions?

I’m not sure. I’ve never fiddled with this setting.

Maybe ping @BanzaiMan. He usually bails me out when I hit a wall.

I’m re-creating my .travis.yml piece-by-piece to see what will break it.

Yeah, sounds like a good idea.

By the way, you also have arch: ppc64le and arch: s390x, if interested. s390x is a big-endian architecture. You will definitely want to use Bionic with those two.

stephengold · March 13, 2020, 11:00pm

Okay.

noloader · March 13, 2020, 11:07pm

Specifying openjdk-8-jdk in place of default-jre failed, timing out while installing APT packages

This is weird. I’ve never had a problem updating or installing with apt-get. The docs do say to run update like this:

before_install:
  - sudo apt-get update

Or:

addons:
  apt:
    update: true

Maybe you can update before installing (that’s something I do).

stephengold · March 13, 2020, 11:17pm

3 jobs in parallel didn’t break the build. So my hypothesis was wrong.

Adding a “deploy:” section causes it to break. I see a bisection search in my future

stephengold · March 14, 2020, 1:44am

Looks like the culprit is the “secure:” line in the deployment section. The OAUTH token is very long, which might be part of the cause. I’ll investigate other ways to authenticate for GitHub Releases Uploading.

stephengold · March 16, 2020, 7:25pm

I’m stuck.

By trial and error, I discovered a workaround for the issue:
truncating the OAUTH token in the “deploy” section of the YML script.
Here’s the failing job just before applying the workaround:

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/661310771

Here’s the job after applying the workaround:

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662274752

And here’s a failure after restoring the complete OAUTH token:

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/662518118

It’s surprising that a long line in the “deploy” section of the script
can impact the “git checkout” phase of a job, even though deployment is
disabled (due to no tag). Perhaps it’s a buffer overflow.

While this workaround allows builds to succeed, it’s not satisfactory.
Eventually I’ll need to deploy the built binary files. Deployment
requires the complete OAUTH token, so it will fail.

Please work with me to come up with a fix or at least a better workaround.

stephengold · March 18, 2020, 12:00am

I’m unwilling to authenticate with my GitHub username and password, which is the only alternative I’ve found.

@BanzaiMan do you have any suggestions?

junaruga · March 19, 2020, 9:09am

Ruby project’s arm32 case on arch: arm64 also hit a stack issue too.
https://travis-ci.org/github/ruby/ruby/jobs/663824853#L2856

stephengold · March 19, 2020, 5:07pm

Could be related. There are some very long lines of text in the Travis config file for the Ruby project.

junaruga · March 20, 2020, 5:22pm

@stephengold

There are some very long lines of text in the Travis config file for the Ruby project.

The long line is possibly okay, because Ruby project .travis.yml is heavily customized with the long lines.

github.com

ruby/ruby/blob/master/.travis.yml

# -*- YAML -*-
# Copyright (C) 2011 Urabe, Shyouhei.  All rights reserved.
#
# This file is  a part of the programming language  Ruby.  Permission is hereby
# granted,  to either  redistribute  or  modify this  file,  provided that  the
# conditions  mentioned in  the file  COPYING are  met.  Consult  the  file for
# details.

# When you see Travis CI issues, or you are interested in understanding how to
# manage, please check the link below.
# https://github.com/ruby/ruby/wiki/CI-Servers#travis-ci

# We enable Travis on the specific branches or forked repositories here.
if: (repo = ruby/ruby AND (branch = master OR branch =~ /^ruby_\d_\d$/)) OR repo != ruby/ruby OR commit_message !~ /\[DOC\]/

language: c

os: linux

dist: jammy

This file has been truncated. show original

junaruga · March 20, 2020, 5:42pm

Ruby project’s arm32 case on arch: arm64 also hit a stack issue too.
Travis CI - Test and Deploy with Confidence

Above job was failed due to the total 50 minutes limitation, which has not happened before.

Now new job is failed to due 1 command’s 10 minutes no output limitation that is “No output has been received in the last 10m0s”.

The situation is changed. But it seems that the arch: arm64 is still unstable.

stephengold · March 21, 2020, 1:50am

It appears something changed in ARM64: one of my builds succeeded without the workaround!

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/665090271

stephengold · March 21, 2020, 2:43am

The improvement (whatever it was) did not persist:

https://travis-ci.org/github/stephengold/Libbulletjme/jobs/665098106

Topic		Replies	Views
ARM64/AARCH build is not detected as finished and then fails Multi CPU Architecture build-env , travis-build	3	778	February 25, 2020
Travis arm64: a job does not start with "An error occurred while generating the build script." Multi CPU Architecture arm64	23	2301	November 10, 2021
Arm64 builds hang for hours and then error out with "Automatic restarts limited" Multi CPU Architecture	5	895	October 23, 2020
No cache support on arm64? Multi CPU Architecture	29	3385	July 3, 2020
About the Multi CPU Architecture category Multi CPU Architecture	27	4626	October 23, 2020

ARM builds started getting stuck 18 February

Related topics