For the last few days (around a week) we had a problem in our Apache Airflow builds that we investigated as not enough entropy.
Our builds started to fail intermittently sometimes they worked, sometimes they did not. The problems were connected with some of the docker-compose images that we started. Those docker compose images were kerberos, mysqldb, cassandra and it looked like in some of the builds the docker containers for those external systems were missing (or at least we could not connect to them). It did not happen before - it started to happen recently.
After some digging-in and additional diagnostics we found the most possible culprit. Seems like some of the containers (kerberos, mysqldb, cassandra) started to start very slowly (10s of seconds instead of < second). There is no lack of memory and CPU - but what we found is the slowest part of those start-up time was related to some security-related code (generating self-signed certificates). We have all the reasons to believe (and will check it further) that the reason is that we do not have enough entropy on the physical machines. Docker uses entropy provided by the host machine and apps like mysql, cassandra, kerbers needs entropy to generate security-related artifacts.
It looks like when the machines are busy and a lot of parallel builds and dockers on the same machine simply use more entropy than is available and slow everything down. We are going to workaround that, but I think it should be fixed by Travis.
Example build that shows that is here:
Where you can see that generating self-signed certificate for mysql takes 7 seconds while mysqld startup.
2020-01-15T10:45:54.028059Z 0 [Warning] Gtid table is not ready to be used. Table 'mysql.gtid_executed' cannot be opened. 2020-01-15T10:46:01.181555Z 0 [Warning] CA certificate ca.pem is self signed.