Run "finally" block even if build hangs for more than 10 minutes

itaibh · March 7, 2021, 9:19am

I have a rare hang in the build that cause it to hang so no output is written and the build process is terminated after 10 minutes.
I also have proxy dumps I collect during that time (I suspect a network issue) which I want to be able to use, but they are only uploaded to Azure in the after_script: part of .travis.yml, which in that case isn’t getting called.
Is there another part of the .travis.yml I can use that will ALWAYS get called, even in case of build errors? Something like the finally clause of a try/catch/finally?

native-api · March 7, 2021, 10:52am

There’s no such “another part”. That feature is there to terminate hung builds – so it assumes that the build logic has malfunctioned and any further actions or calls to it are thus meaningless. (E.g. you have a guard for this particular step in mind, but not for other steps.)

So instead, add some kind of guard that would terminate your command if it runs for too long, or if it doesn’t produce any output for too long (by e.g. piping its output to another process which runs a background timer that resets each time an output is seen). This guard process shall produce some output regularly so that the build isn’t killed.

Alternatively, add tracing to your code (e.g. instead of running a proxy, make the program itself report its traffic) so that you can see at which exact place in the logic it hangs.

itaibh · March 7, 2021, 11:21am

I understand, but what if I can’t? The before_script is there to initialize things for the job(s) itself. The after_script is for cleaning up and collecting all data that wasn’t reported to the log (say, due to it’s size).
Why not call it anyway? Why is it considered meaningless if the build was hung?

In my specific case I can’t write it to the log as it becomes too large and the build would fail. The network dump is ~500 MB JSON file. Even if I just write the simplest form, just the output (1 line per request, 1 line per response) it still might break the build and I wouldn’t be able to gather information from it.

The job itself is quite simple, eventually:

dotnet test -f net5.0 -c Release /path/to/some.csproj

I don’t see how I can guard it as you suggest. Perhaps with travis_wait?

native-api · March 7, 2021, 6:40pm

The script section is designed to run tests: the build continues if a command in script: fails.

Perhaps it makes sense to also make the build continue and consider the command failed if it hangs.

So an easy solution would be to make the command automatically fail if it hangs.

E.g. with this fn written based on start_spinner from https://github.com/matthew-brett/multibuild/blob/be06f5f857fa6865701da4980f3e879b10c6b717/common_utils.sh#L40-L54:

function watchdog() {
    WD_PID=${1:?}
    WD_TIMEOUT=${2:-550}

    (while true; do
        ps -p "$WD_PID" &>/dev/null || break
        read -r -t "${WD_TIMEOUT}" LINE
        if [[ $? -le 128 ]]; then
            echo "$LINE"
        else
            echo "${FUNCNAME[0]}: No output within ${WD_TIMEOUT}s, killing the command"
            kill -KILL "$WD_PID" || true
            break
        fi
    done) <&0 &
    WD_SPINNER_PID=$!
    wait "$WD_PID"; ret=$?; kill "$WD_SPINNER_PID" &>/dev/null || true; wait "$WD_SPINNER_PID"
    return $ret
}

The command is then run like (requires Bash 4):

coproc <command> 2>&1; watchdog "$COPROC_PID" <&${COPROC[0]};

itaibh · March 8, 2021, 6:13am

Thank you for taking my questions seriously and actually finding solutions. I didn’t try that solution yet, once I will I’ll let you know if it works for me.

Thanks again.

Topic		Replies	Views
Build job hangs after failed deploy macOS	3	836	February 14, 2019
Builds hang, with output truncated mid-line Multi CPU Architecture	14	1800	May 16, 2020
Logs truncated every failed build Travis CI Discussions & Feedback	8	699	February 10, 2020
Script succeeded, but build is still running until it times out (+10min) Windows	2	1148	November 6, 2018
Timeout after build finished (and succeeded) Windows	3	2337	June 4, 2019

Run "finally" block even if build hangs for more than 10 minutes

Related topics