Skip to content

Conversation

@oliver-sanders
Copy link
Member

Pre-release testing for 8.5.0 flagged several test issues.

  • Some tests haven't been run for a while (PBS testing was very difficult for us on the old system, recently retired).
  • Some tests needed modification to run on the new system.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders oliver-sanders added this to the 8.5.x milestone Jul 25, 2025
@oliver-sanders oliver-sanders self-assigned this Jul 25, 2025
Comment on lines +108 to +109
if cylc.flow.flags.verbosity > 1:
print(f'$ ln -s "{target}" "{path}"')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make symlink dir issues easier to debug.

Comment on lines 28 to 36
[[_retrieve]]
$(cylc config -i "[platforms][$CYLC_TEST_PLATFORM]")
[[_retrieve]]
retrieve job logs = True
install target = $CYLC_TEST_PLATFORM
[[_no_retrieve]]
$(cylc config -i "[platforms][$CYLC_TEST_PLATFORM]")
[[_no_retrieve]]
retrieve job logs = False
install target = $CYLC_TEST_PLATFORM
"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Allow default [directives] to be specified in the test config.
  • install target should already be set correctly.

Comment on lines 52 to 53
sed -i -E 's/--max-size=[^ ]* //' 'my-rsync.log.edited' # strip "retrieve job logs max size" arg
sort -u 'my-rsync.log.edited' # stip out duplicates (can result from PBS log file spooling)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Strip --max-size options which may or may not be added based on config.
  • Tolerate job log retrieval retries (needed when PBS output spooling is in play).

Comment on lines 36 to -38
[[goodhostplatform]]
hosts = ${CYLC_TEST_HOST}
install target = ${CYLC_TEST_INSTALL_TARGET}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't define platforms like this because they don't inherit any config from $CYLC_TEST_PLATFORM.

Comment on lines -68 to -70
named_grep_ok "job kill retries & succeeds" \
"\[jobs-kill out\] \[TASK JOB SUMMARY\].*1/mixedhosttask/01" \
"${LOGFILE}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test turned out to be fragile. Replaced with more targetted checks.

Comment on lines +27 to +31
[[goodhosttask]]
script = sleep 60
platform = goodhostplatform

[[mixedhosttask]]
script = sleep 60
platform = mixedhostplatform
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure both tasks are running when the kill command is sent.

share/cycle = \$TMPDIR/\$USER/cylctb_tmp_share_dir
work = \$TMPDIR/\$USER
[[[$CYLC_TEST_INSTALL_TARGET]]]
run = \$TMPDIR/\$USER/test_cylc_symlink/ctb_tmp_run_dir
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$TMPDIR/$USER does not exist on our new machine and $TMPDIR path changes from node to node :/

Switched to a permanent directory $HOME/cylctb-symlinks with one subdir for each test.

[[symlink dirs]]
[[[${CYLC_TEST_INSTALL_TARGET}]]]
run = \$TMPDIR/\$USER/sym-run
run = \$HOME/cylctb-symlinks/$TEST_NAME/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@oliver-sanders oliver-sanders marked this pull request as draft July 25, 2025 12:08
Comment on lines +27 to +34
cylc message -- 'echo done'
# wait up to PT1M for the cat-log task to succeed
# (the workflow will shut down if cat-log fails)
sleep 60
# fail if the task was not orphaned by this point
false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was supposed to be testing cat-log against running tasks:

Test "cylc cat-log" of currently-running local and remote jobs.

However, there was nothing to keep the task running, so it was actually testing succeeded jobs.

I've changed the approach, the tasks now sleep for 60s, then fail.

If the cat-log works, it will cylc set the task, orphaning the sleep and triggering fin which is used to diagnose test success.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to an early version of this test, and it used to have sleep 60 there. Maybe got borked during migration to Cylc 8.

Copy link
Member Author

@oliver-sanders oliver-sanders Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matt did purposefully try to pull out sleeps from the tests in order to speed them up reduce flakyness. It might have vanished then.

In this particular case, the sleep is ok, but it could have easily been mistaken.

@MetRonnie MetRonnie changed the base branch from master to 8.5.x July 29, 2025 11:47
@oliver-sanders oliver-sanders changed the base branch from 8.5.x to master September 30, 2025 14:58
local OPTS="$*"
local TEST_NAME
TEST_NAME="grep-ok: ${NAME}"
TEST_NAME="$(basename "${FILE}")-grep-ok"
Copy link
Member Author

@oliver-sanders oliver-sanders Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert an erroneous change made in 15ac2fb which broke test reporting.

@oliver-sanders
Copy link
Member Author

All tests now passing on _remote_background_indep_pbs

@oliver-sanders oliver-sanders modified the milestones: 8.5.x, 8.6.0 Sep 30, 2025
@hjoliver
Copy link
Member

hjoliver commented Oct 1, 2025

(I had a quick run through this; it generally LGTM, mysterious polling result notwithstanding).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have made this _remote_background_indep_* only. The consequences of using the _remote_pbs_indep_tcp don't make sense.

# Test job kill will retry on a different host if there is a connection failure

export REQUIRE_PLATFORM='loc:remote fs:indep comms:tcp'
export REQUIRE_PLATFORM='loc:remote fs:indep comms:tcp runner:background'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test relies on mocking background submission (my_background) so can only be run with background platforms.

Comment on lines +59 to +60
# give a bit of grace time for the job to leave the job runner queue
sleep 5
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unavoidable sleep :(

Give PBS a grace period to remove the finished job from the queue before attempting to poll.

This failed reliably yesterday, but passed flakily today. Looks like the required grace period is fairly small.

@oliver-sanders
Copy link
Member Author

Have added a bunch of fixes for tests which were using spaces, colons, slashes, etc in test IDs. These IDs are used as filename prefixes.

* Cylc platforms defined in tests must inherit from the test platform
  config.
* Reference tests only capture triggering, not task outcome (add a
  downstream task if you need to check outputs).
* Fix some timing issues.
* Use the platform configured SSH command for any add-hoc SSH'es.
* Remove `TMPDIR` dependence (ain't always there!).
* Test requires the background job runner.
* Test names cannot contain spaces.
* Revert an erroneous change made in 15ac2fb which broke test reporting.
* Tasks needed to wait for their started messages.
* Give the job runner a few seconds grace time for the job to exit the
  queue.
* Document the test.
* Test works by overriding the background job runner so is not
  compatible with other job runners.
* Test is only compatible with background platforms.
* Test IDs should be prefixed with "$TEST_NAME_BASE".
* Test IDs should be valid for use as file names (no ":", "/", " ", etc).
* `named_grep_ok` was broken by default :(
@oliver-sanders oliver-sanders marked this pull request as ready for review October 1, 2025 15:03
@oliver-sanders oliver-sanders added this to the 8.6.x milestone Oct 1, 2025
@oliver-sanders
Copy link
Member Author

All tests now passing on:

  • _local_slurm_indep_tcp
  • _remote_background_indep_tcp
  • _remote_pbs_indep_tcp

(and the other remote platforms covered by CI)

@oliver-sanders
Copy link
Member Author

@hjoliver, @wxtim, gentle poke, would really like to have the tests passing on master.

@wxtim
Copy link
Member

wxtim commented Oct 15, 2025

Over 8 runs against _remote_pbs_indep_tcp I have failures

  • 1 tests/functional/events/17-task-event-job-logs-retrieve-command.t
  • 1 tests/functional/events/11-cycle-task-event-job-logs-retrieve.t
  • 3 tests/functional/remote/08-symlink-dir-target-exist.t
  • 5 tests/functional/cylc-cat-log/09-cat-running.t
  • 1 tests/functional/remote/04-symlink-dirs.t

@oliver-sanders
Copy link
Member Author

All pass for me. Have you synced your branch to the test platform?

@wxtim
Copy link
Member

wxtim commented Oct 16, 2025

Now have all tests passing. Some of these seem a little sensitive to global config settings.

@oliver-sanders
Copy link
Member Author

(for info, the failure(s) Tim was seeing were related to timing issues around PBS job log spooling, we have to configure log retrieval retries to make this work)

@oliver-sanders
Copy link
Member Author

Ping @hjoliver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants