Segfaults in arm64 environment

Hi Everyone,

We successfully cut-in arm64 testing with GCC and Clang late last week. The Linux testing uses Xenial images. All CI testing passed.

This week we notice there are unexplained segfaults in arm64. An example is here. A typical message is shown below.

Testing SymmetricCipher algorithm RabbitWithIV.
............................................................
............................................................
.........../home/travis/.travis/functions: line 134:  2018 S
egmentation fault

The segfault moves around. Sometimes one set of tests fail, and at other times another set of tests fail.

We reverted a few commits to go back to the last known good but the arm64 segfaults persist. Our last known good is here.

We can’t duplicate the arm64 segfaults at the compile farm (GCC117 and GCC118), and we can’t duplicate it on four aarch64 dev-boards. We also cannot duplicate it on other arch’es and environments, like x86, x86_64, arm, ppc64le or ppc64be.

I’m beginning to suspect the script /home/travis/.travis/functions or something similar.

From the build information at the head of the output, I’ve gotten this far:

  • lask known good: travis-build version: 2f1f818b6
  • segfaults: travis-build version: a91ac50bd

One of the side effects of the build version change:

  • lask known good: gcc (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1) 7.4.0
  • segfaults: gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.11) 5.4.0

My first question is, what changed on Sunday or Monday in the arm64 environment?

My second question is, how do we work around the changes?

Thanks in advance.

Hi @noloader

Thanks for detailed feedback and happy to see you using Arm builds!

One request: are you able to verify if the segfaults occur with dist: bionic ?

Before Monday all dist references ended up with Ubuntu Bionic OS image anyway - the OS reported in build job log Build Environment section is taken from .travis.yml rather than from actual image.

On Monday we’ve added actual Xenial OS image in order to allow builds run within LXD container on proper target OS. So your dist: xenial started to be actually built on Xenial run as an LXD container on LXD host (the LXD host is Bionic at the moment ).
(all of above in context of Arm64 builds of course)

So - by checking if segfaults occur on dist: bionic along with info you provided so far could help narrow down the cause.

Best Regards
Michał

Hi @noloader!

Is this segmentation fault on arm64 still occurring for you? I’ve seen you’ve been adopting also IBM power and Z targets. The reason I’m asking is, that at the moment our cache meant to keep items between builds and beta workspaces (artifacts between jobs in a build) should work correctly also for different architectures. If you have a binary, for which the segmentation fault is reproducible, it’d help us to debug it. Maybe such binary could be deployed using recent DPL2 to a place, from where we could download it from?

Hi, we’ve also been seeing segfaults on arm64. I can’t reproduce them locally, but we suspect that the issue might be overheating of the hardware or a hardware fault. See https://github.com/bitcoin/bitcoin/issues/17481

Compiling will use 100% CPU, so if the heat is not properly dealt with, it could lead to intermittent hardware issues.

Here is another intermittent segfault in the compiler: https://travis-ci.org/MarcoFalke/bitcoin-core/jobs/615121672#L8342

Edit: And another one: https://travis-ci.org/bitcoin/bitcoin/jobs/615102790#L13961

We think we narrowed it down to Xenial images and/or the compiler. bitcoin-core seems to have the same problem. From Build System Information around line 7:

Build language: minimal
Build group: stable
Build dist: xenial
Build id: 615102787
Job id: 615102790
Runtime kernel version: 5.3.0-22-generic
travis-build version: a09969ae2

You can probably sidestep the problem by switching to Bionic images. Just use dist: bionic in you Travis yml file.

I suspect a Xenial dist-upgrade will also fix the issue, but I don’t know for sure. I seem to recall the Travis docs ask folks to avoid dist-upgrade, so I’m not even sure you can dist-upgrade a Travis image.

We use the xenial travis image, but run everything in a bionic docker. See https://travis-ci.org/MarcoFalke/bitcoin-core/jobs/615121672#L116

Because it is only intermittent, I think the issue is with the hardware/overheating as mentioned previously.

Hi, I met similar issue on travis aarch64 with ubuntu bionic. Please see the build report:
https://travis-ci.org/ovsrobot/ovs/jobs/621434216#L1999

I saw some similar issues in the community. Does any know the cause and any solution to avoid it?

Hi,

As the gcc segment faults issue occurred unexpected, I would like to do some investigation for the possible cause. I checked the report at https://travis-ci.org/ovsrobot/ovs/jobs/621434216#L2005. I found a log: “See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.”

Is there a way to fetch the bug report at /usr/share/doc/gcc-5/README.Bugs in the previous build job? Any response is appreciated.

Hi @yzyuestc
Sorry I am afraid there is no way, container is destroyed after job is done.

Since this is a static file, thus should be the same for every job run in the same environment, you can either cat it (or upload somewhere) in another build, or get the corresponding package online and extract it from there.
(And, just to be clear, this is not a bug report but a part of the gcc package’s documentation – perhaps with instructions how to file bug reports.)

@yzyuestc

Did you manage to capture anything more?

@Marco @noloader @yzyuestc - thank yopu for your reports and effort so far.
This is kind of vanishing point for us.

  1. Happens occasionally/on specific builds, but clearly often enough to be a stability issue while it’s not happening on different hardware/outside of LXD container
  2. Happens with dist: xenial so far and gcc7 , however there is at least one confirmed case of problem occurring on bionic (see OvS)

The hardware/temperature issue - we cannot verify it on our end easily. We’re not owning the infrastructure. I’ll ask around though.

The gcc version - can you check if the issue re-occurs with most up-to date gcc version installed at the beginning of the job?

The xenial vs bionic - LXD container is running a Xenial Ubuntu, however the kernel itself is shared by host (it’s the core concept for this container), which is 5.x from Bionic host. As far as we saw and know, this combination has been stable on arm64, yet seems to be a trigger for the initial case. Verification requires a binary being result/partial result of failed job. The binary - is this possible for any of you to provide/test locally specifically a binary parts uploaded/deployed from failed build? Does it fail at your local environments?

@noloader @Marco @yzyuestc

We have updated setup on our end - to which machines the arm jobs are redirected - to indirectly verify one suspicious instance. The change is effective as of 2020 Jan 9th, 13:00 UTCZ. Could you please let us know, do you still observe random segmentation faults after that time?

1 Like

Hi Michal, @Michal

Thanks for your effort and update. We will try some new builds to observer whether the segfault issue will occur or not.

Thank you again!

Hi Michal,

We have not experienced the issue since switching to Bionic images.

Jeff

Hi Jeff
Appreciated, thank you!

Hi @yzyuestc
We will wait for your feedback before continuing with potential changes. We’ve been in touch with infrastructure provider to solve it permanently and would like to be sure if now everything works stable.

Hi Michal,

The segment fault issue has not been reproduced since the time your replied. It seems everything works fine on my side.

Thank you @yzyuestc.
One down. We will cure the patient :wink:

Now to the other on segfaults :wink:

1 Like