Travis job fails tests randomly and terminates builds abruptly

I have a travis stage which executes a command. This stage runs integration tests which are written using scala test and wiremock.

   - sbt doSomething

Assume that doSomething runs my integration tests. When this runs on travis, my tests fail randomly. Sometimes fail with 598 network timeout, sometimes 404 or 500. It is highly irregular. Sometimes build stops and shows a message

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself. Check the details on how to adjust your build configuration on:

Is there a way to set this timeout to larger value for an sbt project ?

but when I start the same stage in Debug Mode and debug that build in ssh it opens a path in my mac’s terminal. Then if I run sbt doSomething I don’t see any failures, with all tests passing no matter how many times I run the same command.

Ideally both are doing same operations. Why is that it is irregular on travis and why it always passes in quiet mode ? Is that a problem with travis CI ? Is there a way to sort it out ?

Without a link to a build and an MCVE of what you are doing locally, I can’t say anything specific.

Most probably, a Travis builder is much less powerful than your local machine which breaks assumptions that your code implicitly makes – things take longer than you expect and/or execute in a different order.

but is it not that debug mode of travis with ssh not same as running on travis directly ?

Oh, by “opens a path in my mac’s terminal” you meant you are getting an SSH session to Travis machine. I thought that meant that you were running things locally.

It’s the same VM but not quite the same workload: there isn’t logging, the stock build logic doesn’t run and standard stream writes would wait on SSH’s socket if their buffer fills (If there are any other subtle changes like priority boosting to give you a reasonable interactive response time, that’s something only @BanzaiMan or @dominic can answer).

My practice also shows that there can be significant performance fluctuations from job to job – presumably from differing loads on VM hosts.

So your best option here IMO is to make your code give you more information on what is happening so that you have an idea where it’s stuttering if an error occurs. And also sanity-check your test logic for any obvious race conditions (like making sure a server is actually up and running when you are trying to use it).

Is there way that we can make the actual job analogous to the debug mode ? Something like a lighter version ??

I cannot answer that, only someone from the staff can.
For me, this is a clear case of a Heisenbug – and as such, you have to run exactly the load that exposes it and examine the program’s state if and when it happens. Altering the environment in any way is not a guarantee that it will go away (because you don’t know what exact combination of circumstances causes the bug). E.g. you could have just happened to get a fast VM for your debug session.