We successfully cut-in arm64 testing with GCC and Clang late last week. The Linux testing uses Xenial images. All CI testing passed.
This week we notice there are unexplained segfaults in arm64. An example is here. A typical message is shown below.
Testing SymmetricCipher algorithm RabbitWithIV.
............................................................
............................................................
.........../home/travis/.travis/functions: line 134: 2018 S
egmentation fault
The segfault moves around. Sometimes one set of tests fail, and at other times another set of tests fail.
We reverted a few commits to go back to the last known good but the arm64 segfaults persist. Our last known good is here.
We can’t duplicate the arm64 segfaults at the compile farm (GCC117 and GCC118), and we can’t duplicate it on four aarch64 dev-boards. We also cannot duplicate it on other arch’es and environments, like x86, x86_64, arm, ppc64le or ppc64be.
I’m beginning to suspect the script /home/travis/.travis/functions or something similar.
From the build information at the head of the output, I’ve gotten this far:
lask known good: travis-build version: 2f1f818b6
segfaults: travis-build version: a91ac50bd
One of the side effects of the build version change:
lask known good: gcc (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1) 7.4.0
Thanks for detailed feedback and happy to see you using Arm builds!
One request: are you able to verify if the segfaults occur with dist: bionic ?
Before Monday all dist references ended up with Ubuntu Bionic OS image anyway - the OS reported in build job log Build Environment section is taken from .travis.yml rather than from actual image.
On Monday we’ve added actual Xenial OS image in order to allow builds run within LXD container on proper target OS. So your dist: xenial started to be actually built on Xenial run as an LXD container on LXD host (the LXD host is Bionic at the moment ).
(all of above in context of Arm64 builds of course)
So - by checking if segfaults occur on dist: bionic along with info you provided so far could help narrow down the cause.
Is this segmentation fault on arm64 still occurring for you? I’ve seen you’ve been adopting also IBM power and Z targets. The reason I’m asking is, that at the moment our cache meant to keep items between builds and beta workspaces (artifacts between jobs in a build) should work correctly also for different architectures. If you have a binary, for which the segmentation fault is reproducible, it’d help us to debug it. Maybe such binary could be deployed using recent DPL2 to a place, from where we could download it from?
Hi, we’ve also been seeing segfaults on arm64. I can’t reproduce them locally, but we suspect that the issue might be overheating of the hardware or a hardware fault. See https://github.com/bitcoin/bitcoin/issues/17481
Compiling will use 100% CPU, so if the heat is not properly dealt with, it could lead to intermittent hardware issues.
We think we narrowed it down to Xenial images and/or the compiler. bitcoin-core seems to have the same problem. From Build System Information around line 7:
You can probably sidestep the problem by switching to Bionic images. Just use dist: bionic in you Travis yml file.
I suspect a Xenial dist-upgrade will also fix the issue, but I don’t know for sure. I seem to recall the Travis docs ask folks to avoid dist-upgrade, so I’m not even sure you can dist-upgrade a Travis image.
As the gcc segment faults issue occurred unexpected, I would like to do some investigation for the possible cause. I checked the report at https://travis-ci.org/ovsrobot/ovs/jobs/621434216#L2005. I found a log: “See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.”
Is there a way to fetch the bug report at /usr/share/doc/gcc-5/README.Bugs in the previous build job? Any response is appreciated.
Since this is a static file, thus should be the same for every job run in the same environment, you can either cat it (or upload somewhere) in another build, or get the corresponding package online and extract it from there.
(And, just to be clear, this is not a bug report but a part of the gcc package’s documentation – perhaps with instructions how to file bug reports.)
@Marco@noloader@yzyuestc - thank yopu for your reports and effort so far.
This is kind of vanishing point for us.
Happens occasionally/on specific builds, but clearly often enough to be a stability issue while it’s not happening on different hardware/outside of LXD container
Happens with dist: xenial so far and gcc7 , however there is at least one confirmed case of problem occurring on bionic (see OvS)
The hardware/temperature issue - we cannot verify it on our end easily. We’re not owning the infrastructure. I’ll ask around though.
The gcc version - can you check if the issue re-occurs with most up-to date gcc version installed at the beginning of the job?
The xenial vs bionic - LXD container is running a Xenial Ubuntu, however the kernel itself is shared by host (it’s the core concept for this container), which is 5.x from Bionic host. As far as we saw and know, this combination has been stable on arm64, yet seems to be a trigger for the initial case. Verification requires a binary being result/partial result of failed job. The binary - is this possible for any of you to provide/test locally specifically a binary parts uploaded/deployed from failed build? Does it fail at your local environments?
We have updated setup on our end - to which machines the arm jobs are redirected - to indirectly verify one suspicious instance. The change is effective as of 2020 Jan 9th, 13:00 UTCZ. Could you please let us know, do you still observe random segmentation faults after that time?
Hi @yzyuestc
We will wait for your feedback before continuing with potential changes. We’ve been in touch with infrastructure provider to solve it permanently and would like to be sure if now everything works stable.