PPC64LE jobs are stuck from time to time

In the course of a few months we noticed that quite often, while ARMv8 and System Z jobs run from the queue normally, PPC64 have longer starting times, and is quite often stuck in the infinite loop “Job received → Queued”.

See, for example Travis CI - Test and Deploy with Confidence

Then, after a long period of time, it silently fails because of the timeout. It happened multiple times, the last time - yesterday.

Moreover, while some of such jobs can be restarted later when the problematic period ends, some always fail, no matter how often you restart. Like this one: Travis CI - Test and Deploy with Confidence

It shows me “Automatic restarts limited: Please try restarting this job later or contact support@travis-ci.com.” every time on such jobs.

Could you please address this problem? The last time I contacted support there was no answer from them at all, and we have a paid plan for two parallel jobs!

Notably, despite many reports from us when such incidents happened for a period of at least a day, or other reports on this forum, the Travis CI Status page is always green, as if nothing happened.

2 Likes

And it happened again. There seems to be a major flow in the infrastructure that should be addressed.

Aaand one more time. It’s getting really ridiculous, given the price we pay $129 for two concurrent jobs every month.

Happened again once more.

1 Like

Hey, this should be already fixed, sorry about the late notice. Thanks.

@mustafa,
It’s happening again.
Mixed with the “queued / booting” issue, there’s also a new issue on git clone about gnutls_handshake.
Our 2 concurrent jobs plan is now effectively becoming a 0 concurrent jobs when there are 2 ppc64le jobs trying to run…

link for the gnutls_handshake error: Travis CI - Test and Deploy with Confidence

I confirm. I noticed this pattern happening on weekends and being unaddressed until working days.
I think they mitigate it by doing something manually every time. What about automating this instead to mitigate the problem if you can’t resolve the core cause?

Just note that it’s been working fine for the past month.
I don’t know what changed but it’s working…