Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unit tests time out randomly on ubuntu-latest #673

Open
MBartkowiakSTFC opened this issue Feb 18, 2025 · 4 comments
Open

[BUG] Unit tests time out randomly on ubuntu-latest #673

MBartkowiakSTFC opened this issue Feb 18, 2025 · 4 comments
Labels
bug Something isn't working CI/CD Something concerning the CI/CD pipeline

Comments

@MBartkowiakSTFC
Copy link
Collaborator

Description of the error
Our unit test workflow seems to get stuck on the ubuntu-latest runner. We have 4 workflows on ubuntu-latest, for different versions of Python. Sometimes one out of four (or, rarely, two out of four https://github.com/ISISNeutronMuon/MDANSE/actions/runs/13350554804 ) ends up marked as failed, with the error:
The job running on runner GitHub Actions X has exceeded the maximum execution time of 360 minutes.

Describe the expected result
Most of the time, if the tests all passed on other platform, they should also pass on Linux.

Describe the actual result
Usually, the test all complete without problems. Sometimes, the unit tests stop running after the first file. The output is:
Analysis/test_average_structure.py ................ [ 2%]
and the next test in
Analysis/test_dynamics.py
either never starts, or never completes.

In the end the entire workflow times out.

Suggested fix
Since it is not reproducible, and seems only to affect one platform, it is difficult to recognise if the error is on our side, or is a problem with the runner. However, if the test fails, it seems to fail always at the same point in the workflow, so maybe the unit tests need to be changed somehow.

From the different point of view, we could also lower the time limit for the workflows, so they fail sooner. We don't normally need more than 20 minutes, so a limit of 30 minutes would be more than enough.

Additional details
N/A

@MBartkowiakSTFC MBartkowiakSTFC added bug Something isn't working CI/CD Something concerning the CI/CD pipeline labels Feb 18, 2025
@MBartkowiakSTFC
Copy link
Collaborator Author

It seems that the tests always get stuck on the Ubuntu runner, but not always on the same test after all. Last failure:

Run cd MDANSE/Tests/UnitTests
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/runner/work/MDANSE/MDANSE/MDANSE
configfile: pyproject.toml
collected 702 items

Analysis/test_average_structure.py ................                      [  2%]
Analysis/test_dynamics.py .............................................. [  8%]
Error: The operation was canceled.

@MBartkowiakSTFC
Copy link
Collaborator Author

And another one, also on Ubuntu:

Run cd MDANSE/Tests/UnitTests
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/runner/work/MDANSE/MDANSE/MDANSE
configfile: pyproject.toml
collected 702 items

Analysis/test_average_structure.py ................                      [  2%]
Analysis/test_dynamics.py .............................................. [  8%]
..............................................                           [ 15%]
Analysis/test_infrared.py ..                                             [ 15%]
Analysis/test_mcstas.py x                                                [ 15%]
Analysis/test_mdmc_h5md.py ..                                            [ 16%]
Analysis/test_meansquare.py ..                                           [ 16%]
Analysis/test_mock_dynamics.py ......                                    [ 17%]
Analysis/test_molecule_names.py ..                                       [ 17%]
Analysis/test_qvectors.py ..                                             [ 17%]
Analysis/test_resolutions.py .......                                     [ 18%]
Analysis/test_scattering.py ..................                           [ 21%]
Error: The operation was canceled.

@oerc0122
Copy link
Collaborator

Suggest we might try adding a global timeout to the test so we might actually get some debug information rather than GitHub killing the runner?

@MBartkowiakSTFC
Copy link
Collaborator Author

We just had another test that got stuck: https://github.com/ISISNeutronMuon/MDANSE/actions/runs/13585226327/job/37978614629

It was a parallel job. The most relevant part of the error output is:

parameters = {'atom_selection': '{}', 'atom_transmutation': '{}', 'frames': 'all', 'grouping_level': 'atom', ...}
traj_info = ('short_traj', '/home/runner/work/MDANSE/MDANSE/MDANSE/Tests/UnitTests/Analysis/../Converted/short_trajectory_after_changes.mdt')
job_info = ('MeanSquareDisplacement', ['msd'], False)
running_mode = ('multicore', -4), output_format = 'MDAFormat'

    @pytest.mark.parametrize("traj_info,job_info,running_mode,output_format", total_list)
    def test_dynamics_analysis(
        parameters, traj_info, job_info, running_mode, output_format
    ):
        temp_name = tempfile.mktemp()
        parameters["trajectory"] = traj_info[1]
        parameters["running_mode"] = running_mode
        parameters["output_files"] = (temp_name, (output_format,), "INFO")
        job = IJob.create(job_info[0])
>       job.run(parameters, status=True)

Analysis/test_dynamics.py:189: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/MDANSE/Framework/Jobs/IJob.py:423: in run
    IJob._runner[mode](self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <MDANSE.Framework.Jobs.MeanSquareDisplacement.MeanSquareDisplacement object at 0x7fe9b50782e0>

    def _run_multicore(self):
        if hasattr(self._status, "_queue_0"):
            self._status._queue_0.put("started")
    
        inputQueue = self.inputQueue
        outputQueue = self.outputQueue
        log_queue = self.log_queue
    
        log_queues = [log_queue]
        handlers = []  # handlers that are not QueueHandlers
        for handler in LOG.handlers:
            if isinstance(handler, QueueHandler):
                log_queues.append(handler.queue)
            else:
                handlers.append(handler)
    
        listener = QueueListener(log_queue, *handlers, respect_handler_level=True)
        listener.start()
    
        self._processes = []
    
        for i in range(self.numberOfSteps):
            inputQueue.put(i)
    
        for i in range(self.configuration["running_mode"]["slots"]):
            self._run_multicore_check_terminate(listener)
            p = multiprocessing.Process(
                target=self.process_tasks_queue,
                args=(inputQueue, outputQueue, log_queues),
            )
            self._processes.append(p)
            p.daemon = False
            p.start()
    
        n_results = 0
        while n_results != self.numberOfSteps:
            self._run_multicore_check_terminate(listener)
            if self._status is not None:
                self._status.fixed_status(n_results)
            try:
                index, result = outputQueue.get_nowait()
            except queue.Empty:
>               time.sleep(0.1)
E               Failed: Timeout >240.0s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI/CD Something concerning the CI/CD pipeline
Projects
None yet
Development

No branches or pull requests

2 participants