I have a rare hang in the build that cause it to hang so no output is written and the build process is terminated after 10 minutes.
I also have proxy dumps I collect during that time (I suspect a network issue) which I want to be able to use, but they are only uploaded to Azure in the after_script:
part of .travis.yml
, which in that case isn’t getting called.
Is there another part of the .travis.yml
I can use that will ALWAYS get called, even in case of build errors? Something like the finally
clause of a try/catch/finally
?
There’s no such “another part”. That feature is there to terminate hung builds – so it assumes that the build logic has malfunctioned and any further actions or calls to it are thus meaningless. (E.g. you have a guard for this particular step in mind, but not for other steps.)
So instead, add some kind of guard that would terminate your command if it runs for too long, or if it doesn’t produce any output for too long (by e.g. piping its output to another process which runs a background timer that resets each time an output is seen). This guard process shall produce some output regularly so that the build isn’t killed.
Alternatively, add tracing to your code (e.g. instead of running a proxy, make the program itself report its traffic) so that you can see at which exact place in the logic it hangs.
I understand, but what if I can’t? The before_script
is there to initialize things for the job(s) itself. The after_script
is for cleaning up and collecting all data that wasn’t reported to the log (say, due to it’s size).
Why not call it anyway? Why is it considered meaningless if the build was hung?
In my specific case I can’t write it to the log as it becomes too large and the build would fail. The network dump is ~500 MB JSON file. Even if I just write the simplest form, just the output (1 line per request, 1 line per response) it still might break the build and I wouldn’t be able to gather information from it.
The job itself is quite simple, eventually:
dotnet test -f net5.0 -c Release /path/to/some.csproj
I don’t see how I can guard it as you suggest. Perhaps with travis_wait
?
The script
section is designed to run tests: the build continues if a command in script:
fails.
Perhaps it makes sense to also make the build continue and consider the command failed if it hangs.
So an easy solution would be to make the command automatically fail if it hangs.
E.g. with this fn written based on start_spinner
from https://github.com/matthew-brett/multibuild/blob/be06f5f857fa6865701da4980f3e879b10c6b717/common_utils.sh#L40-L54:
function watchdog() {
WD_PID=${1:?}
WD_TIMEOUT=${2:-550}
(while true; do
ps -p "$WD_PID" &>/dev/null || break
read -r -t "${WD_TIMEOUT}" LINE
if [[ $? -le 128 ]]; then
echo "$LINE"
else
echo "${FUNCNAME[0]}: No output within ${WD_TIMEOUT}s, killing the command"
kill -KILL "$WD_PID" || true
break
fi
done) <&0 &
WD_SPINNER_PID=$!
wait "$WD_PID"; ret=$?; kill "$WD_SPINNER_PID" &>/dev/null || true; wait "$WD_SPINNER_PID"
return $ret
}
The command is then run like (requires Bash 4):
coproc <command> 2>&1; watchdog "$COPROC_PID" <&${COPROC[0]};
Thank you for taking my questions seriously and actually finding solutions. I didn’t try that solution yet, once I will I’ll let you know if it works for me.
Thanks again.