Travis Python archive download is flaky

Hi Travis CI,

My daily CRON build failed this morning because for two of my jobs, downloading Travis’s Python archives failed as follows:

pypy2.7-6.0 is not installed; attempting download
Downloading archive: https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/16.04/x86_64/pypy2.7-6.0.tar.bz2
$ curl -sSf -o pypy2.7-6.0.tar.bz2 ${archive_url}
curl: (56) GnuTLS recv error (-54): Error in the pull function.
Unable to download pypy2.7-6.0 archive. The archive may not exist. Please consider a different version.

(The same thing happened for my Python 3.8-dev job with the https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/16.04/x86_64/python-3.8-dev.tar.bz2 download.)

I couldn’t get back to my computer to click the “restart job” button until just now, and that button is hidden on the mobile UI, so my project was sporting the red “build failing” badge all day. After clicking the restart button just now for the 3.8-dev job, it succeeded, so this was a transient error. Since this replaced the previous failed build log, I won’t click retry for the pypy-2.7 job (and leave my project with the “build failing” badge for the time being) so you can still see the failure: Travis CI - Test and Deploy with Confidence

This same thing has happened for my project several times in the past. I suggest making the following improvements (in order of impact):

  1. Improve reliability of these downloads, and document how often they’re expected to fail.
  2. Detect this type of failure and automatically retry later at a better time.
  3. Don’t use the “build failing” badge when this is the only cause of failure (e.g. behave as though allow_failures had been set).
  4. Don’t hide the “restart job” button in the mobile UI.
  5. When a job is restarted, don’t have the new build output clobber the previous build output.

Thanks for your consideration!

1 Like

We can add retries to the curl command to make it a little more resilient.

1 Like

This just happened again in two of the jobs for today’s CRON build:
https://travis-ci.org/jab/bidict/jobs/490102372
https://travis-ci.org/jab/bidict/jobs/490102374

I’ll follow that PR. Thanks for looking into this.

@BanzaiMan, since such flakiness is able to cause erroneous “build failing” badges to be unfairly advertised for working builds, could you please provide users with a one-click manual override to set a project’s build badge back to “passing”?

P.S. I didn’t want to leave my project’s “build failing” badge up any longer (and at this point the failed build log no longer seemed necessary to leave up to help debug), so I just restarted the https://travis-ci.org/jab/bidict/jobs/490102372 job. This time it got past the flaky downloading step. But this could just happen again at any time.

Happened again today: https://travis-ci.org/jab/bidict/jobs/493292613

Just clicked restart build, to hopefully clear the spurious “build failing” badge. While your underlying fix is still in flight, can you provide any workaround or manual override such as any of the ones suggested above? Thanks for your consideration.

Well, a “manual override” is to use language: generic and download and install Python yourself.
https://github.com/matthew-brett/multibuild does this.

I used to do this but decided to go back to using Travis-provided Python versions once they became better-maintained a while back. The kind of workaround I’m looking for is for Travis to implement a stopgap for “don’t mark my whole project as ‘build failing’ when we can tell a job has failed in the first few seconds, before any of its tests have run” given the known flakiness of the Travis Python download step.

Python archives should now be distributed via CDN and the download should be more reliable.

3 Likes

We’ve moved our archives again, to GCS. Speed should be decent, and connections more reliable than S3.

2 Likes

https://github.com/travis-ci/travis-build/pull/1809 adds up to 5 retries for downloading the archives.

Thanks, @BanzaiMan!

1 Like

I just hit this again with Travis’s Python 3.8 environment. The job in question was https://travis-ci.org/jab/bidict/jobs/660752747 but I’ve since clicked restart, so the original job log is no longer appearing, but here is a copy/paste from it:

$ curl -sSf --retry 5 -o python-3.8.tar.bz2 ${archive_url}
curl: (56) GnuTLS recv error (-54): Error in the pull function.
travis_time:end:3b93fcc0:start=1583865134844655944,finish=1583865135204371556,duration=359715612,event=configure
Unable to download 3.8 archive. The archive may not exist. Please consider a different version..)

Of course, improving the reliability (the root cause) would be great, but in the meantime, could you please consider working on any of the other mitigations I mentioned in my post above?:

  • Detect this type of failure and automatically retry later at a better time.
  • Don’t use the “build failing” badge when this is the only cause of failure (e.g. behave as though allow_failures had been set). And/or provide a button that allows converting a non-allow_failures job into an allow_failures job after it has failed (i.e. an “ignore this failure” button), so that the overall build status of a project can remain passing when there is a spurious failure due to network flakiness.
  • Don’t hide the “restart job” button in the mobile UI.
  • When a job is restarted, don’t have the new build output clobber the previous build output.

Thank you for your consideration.