LXD-based ppc64le workers seem to experience networking issues

My .travis-ci.yml contains the following matrix:

matrix:
  include:
  - os: linux
  - os: linux-ppc64le
  - os: linux
    arch: arm64

It works just fine: see

https://travis-ci.org/andreabolognani/kubevirt-libvirt/builds/613084783

for an example of the build succeeding.

However, if I change the matrix to:

matrix:
  include:
  - os: linux
  - os: linux
    arch: ppc64le
  - os: linux
    arch: arm64

that is, use the recommended syntax, the ppc64le job starts consistently failing with

Step 7/14 : RUN dnf install -y dnf-plugins-core &&   dnf copr enable -y @virtmaint-sig/for-kubevirt &&   dnf install -y     libvirt-daemon-kvm-${LIBVIRT_VERSION}     libvirt-client-${LIBVIRT_VERSION}     qemu-kvm-${QEMU_VERSION}     genisoimage     selinux-policy selinux-policy-targeted     nftables     augeas &&   dnf clean all
 ---> Running in 05a0adf730e9
Fedora Modular 30 - ppc64le                     1.7 MB/s | 2.6 MB     00:01
Fedora Modular 30 - ppc64le - Updates            54 kB/s | 3.2 MB     01:01
Fedora 30 - ppc64le - Updates                   0.0  B/s |   0  B     02:00
Error: Failed to download metadata for repo 'updates': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f30&arch=ppc64le [Connection timed out after 30001 milliseconds]
The command '/bin/sh -c dnf install -y dnf-plugins-core &&   dnf copr enable -y @virtmaint-sig/for-kubevirt &&   dnf install -y     libvirt-daemon-kvm-${LIBVIRT_VERSION}     libvirt-client-${LIBVIRT_VERSION}     qemu-kvm-${QEMU_VERSION}     genisoimage     selinux-policy selinux-policy-targeted     nftables     augeas &&   dnf clean all' returned a non-zero code: 1
The command "bash build.sh" exited with 1.

See

https://travis-ci.org/andreabolognani/kubevirt-libvirt/builds/613095270

for an example of the build failing.

Using the new syntax causes Travis CI to switch from VM-based workers to LXD-based workers: the first job has

hostname: 1def08f5-92ba-4619-802e-3109452874ae@9434.production-1-openstack-worker-2
version: v4.0.0 https://github.com/travis-ci/worker/tree/e5cb567e10c0eefe380e81c9a2229ac8fb6a16ce
instance: a693d606-00ad-4638-b035-74c380e43626 travis-ci-sardonyx-xenial-1566261094-697a111 (via amqp)
startup: 41.020004928s

as worker information, while the second has

hostname: 2c87e955-e666-4296-885c-7eaa9e8b0169@21406.lxd-ppc64le-travis-ci-production-1-worker5
version: ? ?
instance: travis-job-andreabolognan-kubevirt-libvi-613095272 a1995848861d6e425dc26aface0d8677f5e848e14317505583ed33b5b06615cb (via amqp)
startup: 2.501914291s

Notice how the version number is not recognized: the aarch64 job for the same build has

hostname: ceb583ae-bb2c-4e89-ab07-b920272e9bbd@739893.lxd-arm64-03-org
version: v6.2.5-2-g995f198 https://github.com/travis-ci/worker/tree/995f1980fbb1eaff25cdc518587f13c1f7e1cdd0
instance: travis-job-andreabolognan-kubevirt-libvi-613095273 88362a8efbf7f095de503c5310074cc4b8c913f2bca6bbc9799c6116461a353d (via amqp)
startup: 5.594971039s

instead, and it succeeds; so my guess is that the problem is not in the use of LXD per se, but rather in the ppc64le LXD environment specifically.

It would be great if you could investigate this issue. I’ll gladly provide whatever additional information you might need.

Have noticed that the current builds for all archs (amd64/arm64/ppc64le) are failing on LXD back end @ travis-ci.com for a missing dependency - https://travis-ci.com/ghatwala/libvirt/builds/137146687 .
“Error: Unable to find a match: libvirt-daemon-kvm-5.0.0 libvirt-client-5.0.0”

That’s due to less than ideally timed changes in the COPR repo, and completely unrelated to the issue I reported above.

P.S. Note that I have a bunch of pending PRs opened against the kubevirt/libvirt repository that seem to at least partially overlap with what you’re doing in your fork, so you might want to take a look at them before you spend time on something that I’ve already addressed there :slight_smile:

1 Like

so the issue on ppc64le seems to be “intermittent network failures” as seen here , build went ahead - https://travis-ci.com/ghatwala/libvirt/jobs/258086747 ahead and yeah sure will have a look at your PR’s too .

Interesting! I was hitting it consistently the other day.

They’ve been merged in the meantime O:-)

Interesting! I was hitting it consistently the other day.

one difference was my latest ppc64le job ran on this instance - lxd-ppc64le-travis-ci-production-1-worker1 whereas yours ran in this one - lxd-ppc64le-travis-ci-production-1-worker5 . Am hoping the intermittent network failures would be temporary and resolved soon and great to see your PR’s merged too :slight_smile: