-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Debugging timeouts: make tests verbose in CI #31087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Timeouts never come along when you want them! |
|
Here we have a timeout for That test is relatively new so could fit with when we started to see CI timeouts. |
|
Very aware that I may be clutching at straws / on a wild goose chase / insert idiom of choice here. |
|
Is there a way we could trigger verbose output via commit message? Basically I know this is a draft, but would it be useful to have a permanent flavor of this PR? |
|
At this point I don't even know if any of this is useful! 🫣 |
What‘s the purpose? Either we want to generally collect information, in which case we just switch on verbose on main. Or we only want to do targeted experiments, in which case a branch in a PR is sufficient. |
For the reason I think Ruth opened this PR? Something is going wonky and it seems CI specific and it'd be helpful to have the verbose output specifically here for debugging the one thing. |
|
Ye but what do you need the „trigger from commit message“ for? We can configure this once and commit it. Either here if we want to limit changes to a specific experimental environment, or on main if we want to generally collect data. |
I don't know what you mean by this
I'm guessing we don't want this all the time, only when a PR is breaking in ways where the short messages are unhelpful. |
|
Ah, I think I understand our mutual misunderstanding: I'm focussed on the flaky timeout issue (where toggling verbosity does not help). You are discussing whether toggling verbosity would be a generally desirable debugging tool. Let's move that discussion out and focus here on the timeouts. You are welcome to open an issue for the general solution if you are interested. |
|
@rcomer I believe your ideas and approach is valuable. Obviously, removing that single test didn't cut it. I noticed here an in the run before that before the timeout, many (>10) tests from the other worker completed. So it's likely not a single long-running blocking task in the other worker. Ideas for further investigation:
|
|
Running in one worker: the first attempt gave no timeouts but there is pretty big variation in how long tests take. On Azure py312 and py313, On MacOS 14 both python versions, So maybe the concurrency is a red herring. Trying again to see what turns up.... |
|
This time MacOS 15 instead of MacOS 14 has the long-running Edit: just realised this test is anyway xfailed, so maybe not one to focus on matplotlib/lib/matplotlib/tests/test_backends_interactive.py Lines 789 to 795 in d68c7e3
|
|
MacOS 14 and 15 have less capacity (CPU and memory) than any other runners In GitHub Actions we set the pytest number of runners to "auto". In this recent PR, pytest chose 3 workers for MacOS and only 2 for Ubuntu. That seems the wrong way around! Maybe we should just fix it at 2? |
|
On Ubuntu-arm at the top of this PR, pytest chose 4 workers. There was 1 timeout and 1 UnraiseableException. |
|
Well that gave me a failure, but it wasn't a timeout. It does have something to do with subprocesses though. The successful runs completed in 16-19 minutes, which I think is pretty consistent with what we get in general. So I don't think fixing at two runners will lose us anything. Re-spinning... |
|
Gah! Webagg timeout. |
|
Possibly related, but when I run locally in parallel, then There are also some knock-on effects of running this test with Qt, i.e., #31049; perhaps something is crashing, but not correctly raising due to it? |
|
Regardless of the time issue, perhaps increasing the density of subprocess-calling tests might trigger more clues. |
|
Huh. Azure only runs 28 tests this way with 165 skips. |
|
In a sample of 5 tries, I don't get any timeouts on Azure when limiting to the subprocess tests. |
|
With the current Let's see if it comes up a third time. |
This PR is not for merging, just using the CI.
Made pytest verbose in hope of generating clues for diagnosing the subprocess timeout problem #30851. Removed the standard Ubuntu tests because I don't think we ever see timeouts there.