I’m getting a no output failure for a test that passes locally on Windows 10 (tested on two different windows machines). Output should be printed to the console in intervals << 1 second, so it is not a result of the job just taking a long time and not printing output for > 10 minutes.
I discovered that it has something to do with parallel processing, implemented in these lines of code, as I do not get failure with parallel processing turned off. FWIW, I’m using multi-threading for parallel processing, and tests do pass on Linux and MacOS in Julia 1.2 and 1.3. Here is my .travis.yml file for reference.
I’ll add that parallel processing for my package works fine in Travis in Julia 1.3 on Windows with different options enabled (i.e. different configurations), and the no output failure only occurs when using parallel processing with another option enabled, conditional = true, which triggers this code and some more control flow. It must be relevant, but I’m not sure how.
Any ideas what might be going on here?
Thanks for any insight or help. Please let me know what additional information or clarification I can add. I know there are some specifics and details likely missing here. I’m just not quite sure where to start.
- Vincent
Note
This is a dupe of this post on the Julia discourse forum. I will cross post links to any answers.
This appears to be a broader issue with Julia 1.3 on Travis-ci windows. I’m getting other tests failing due to no output on Julia 1.3 in Windows now as well. The common denominator is multi-threading. Looking more likely that it is a bug in Travis.
Until you diagnose the problem, you cannot say for sure if it’s Travis’ problem or yours.
A multithreaded program hanging after a few loop iterations very much looks like a deadlock. It’s possible that the new language version does multithreading differently.
Any reason you know of that it would be passing tests on Windows in Julia 1.3 on local machines but not in Travis? That was the only reason I was thinking it might be something on the Travis side.
I’m not sure where to begin to try to diagnose because I can’t reproduce the problem locally.
I think this must be the problem since I’m getting the same problem on Appveyor with tests just hanging. Perhaps the architecture of the VMs is what is causing it to fail only on Travis/Appveyor and not locally. I’ll try to approach from that angle.
I’ll post back here and mark your suggestion that it may be a deadlock as the solution if I end up finding that to be the problem for sure. Thanks for your help.
Your local machine probably has more than 2 cores and much more other resources, too, compared to a build VM. So threads compete more and things happen in a very different order, producing more and different potentials for conflicts.
On the bright side – you’ve discovered a deadlock in your program, before it got into production! Great, that’s exactly what CI is for!