Not enough entropy during builds (likely) causing intermittent failures

For the last few days (around a week) we had a problem in our Apache Airflow builds that we investigated as not enough entropy.

Our builds started to fail intermittently sometimes they worked, sometimes they did not. The problems were connected with some of the docker-compose images that we started. Those docker compose images were kerberos, mysqldb, cassandra and it looked like in some of the builds the docker containers for those external systems were missing (or at least we could not connect to them). It did not happen before - it started to happen recently.

After some digging-in and additional diagnostics we found the most possible culprit. Seems like some of the containers (kerberos, mysqldb, cassandra) started to start very slowly (10s of seconds instead of < second). There is no lack of memory and CPU - but what we found is the slowest part of those start-up time was related to some security-related code (generating self-signed certificates). We have all the reasons to believe (and will check it further) that the reason is that we do not have enough entropy on the physical machines. Docker uses entropy provided by the host machine and apps like mysql, cassandra, kerbers needs entropy to generate security-related artifacts.

It looks like when the machines are busy and a lot of parallel builds and dockers on the same machine simply use more entropy than is available and slow everything down. We are going to workaround that, but I think it should be fixed by Travis.

Example build that shows that is here:

https://travis-ci.org/apache/airflow/jobs/637342462#L3216

Where you can see that generating self-signed certificate for mysql takes 7 seconds while mysqld startup.

2020-01-15T10:45:54.028059Z 0 [Warning] Gtid table is not ready to be used. Table 'mysql.gtid_executed' cannot be opened.
2020-01-15T10:46:01.181555Z 0 [Warning] CA certificate ca.pem is self signed.
1 Like

In a shared environment with containers, each container most likely share the source of entropy with everything else. There is only so much you can use.

In my past experiences, additional sources of entropy, such as HAVEGED, helped. (For Ubuntu, you can use https://packages.ubuntu.com/xenial/arm64/haveged) You might want to look into adding this to your test environment. I am not aware of other practical recommendations.

I’ve already implemented (and merged) a change to map /dev/urandom from the host to /dev/random of all the docker containers: https://github.com/apache/airflow/pull/7185 . So we are good for now. We do not need high security for those keys/certs as this is all CI and those certs are only used locally for testing and wiped after the build completes - so we are perfectly ok with software source of entropy.

But I think it should be anticipated by Travis that if they are using bigger physical machines to run many jobs, they have to make sure to provide a lot more entropy than usual (especially since this is a server, no mouse, no keyboards and no physical HDs connected - such servers have usually much less sources of physical entropy than usual desktop machines.

1 Like

@BanzaiMan, @potiuk

I’ve already implemented (and merged) a change to map /dev/urandom from the host to /dev/random of all the docker containers: [AIRFLOW-6575] Entropy source for CI tests is changed to unblocking by potiuk · Pull Request #7185 · apache/airflow · GitHub

The bug is in the application, not Travis.

Applications running on Linux should be using /dev/urandom nowadays. Applications should not be using /dev/random. The Linux kernel-crypto folks have been quite clear about that for the last decade or so.

From the Linux kernel-crypto mailing list at Re: [RFC PATCH v12 3/4] Linux Random Number Generator:

Practically no one uses /dev/random. It’s essentially a deprecated interface; the primary interfaces that have been recommended for well over a decade is /dev/urandom, and now, getrandom(2).

If the application can be configured to use /dev/urandom instead of /dev/random, you should do so. If the application cannot use /dev/urandom, then file a bug report against the application and cite the kernel-crypto folks recommendations.