Skip to content

DefaultCacheStore: misleading error + no escape hatch when launch dir is on NFS/Lustre/GPFS #6996

@matthdsm

Description

@matthdsm

Bug report

When Nextflow is launched from a directory on a network filesystem that does not support POSIX file locking (NFS, Lustre, GPFS, BeeGFS, …), opening the LevelDB cache database fails with:

ERROR ~ Can't open cache DB: /path/to/launchDir/.nextflow/cache/<uuid>/db

Nextflow needs to be executed in a shared file system that supports file locks.
Alternatively, you can run it in a local directory and specify the shared work
directory by using the "-w" command line option.

This error manifests in `DefaultCacheStore.openDb()` (`DefaultCacheStore.groovy`) when `Iq80DBFactory.open()` throws an exception that is not the known `"Unable to acquire lock"` variant.

Root Cause

`DefaultCacheStore` always places the LevelDB cache directory under `.nextflow/cache/` relative to the pipeline launch directory (`Const.appCacheDir`). On HPC clusters, the launch directory is typically on a shared filesystem (NFS, Lustre, etc.) that does not support the file locking mechanisms LevelDB requires.

Three concrete problems exist today:

1. The underlying exception is swallowed silently

The caught `Exception e` is attached as a cause to the thrown `IOException` but is never logged. Developers and users looking at `.nextflow.log` cannot see what LevelDB actually reported, making diagnosis unnecessarily difficult.

// current code – e is never logged
throw new IOException(msg, e)

2. No escape hatch when the launch directory cannot be moved

The only documented workaround is to run Nextflow from a local directory and use `-w` to redirect the work directory. However, `-w` only controls where task work directories are created — it does not move the `.nextflow/cache/` directory. Users who must launch from a shared path (e.g. a project directory enforced by HPC policies) have no supported way to redirect just the cache.

3. The error message is misleading

The message tells users to use `-w`, which does not actually solve the underlying problem (the cache DB is still on the shared filesystem). Users following this advice will hit the same error.

Proposed Fixes

Fix 1 – Log the root cause (diagnostic)

Log the caught exception at `DEBUG` level in the `else` branch so it always appears in `.nextflow.log`:

log.debug \"Failed to open LevelDB cache at path: \$file -- cause: \${e.message}\", e

Fix 2 – Add `NXF_CACHE_DIR` env-var support

Introduce a `resolveCacheBaseDir()` helper that checks the `NXF_CACHE_DIR` environment variable. When set, the variable overrides the default `.nextflow/` path, allowing users to redirect only the cache DB to a lock-capable local filesystem without moving the launch directory:

export NXF_CACHE_DIR=/tmp/nxf-cache-\$USER
nextflow run pipeline.nf -w /projects/exome/work

Fix 3 – Rewrite the error message

Replace the current misleading message with one that:

  • accurately names the problem (cache DB on a lock-incapable filesystem)
  • lists both remedies: run from a local dir with `-w`, or set `NXF_CACHE_DIR`

Affected file

`modules/nextflow/src/main/groovy/nextflow/cache/DefaultCacheStore.groovy`

Environment

Reproducible whenever the pipeline is launched from NFS / Lustre / GPFS. Common in HPC environments where project directories live on a shared parallel filesystem.

Proposed implementation

A reference implementation of all three fixes is available at:
https://github.com/matthdsm/nextflow/tree/fix/cache-db-nfs-error"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions