nf-upload2irods

Overview

This Nextflow pipeline uploads files and directories to iRODS storage with comprehensive metadata management. The pipeline supports three main operations: file upload with automatic checksum verification, metadata attachment to existing iRODS collections, and metadata retrieval from iRODS collections.

Contents of Repo:

main.nf — the main Nextflow pipeline that orchestrates file uploads and metadata management with iRODS
nextflow.config — configuration script for IBM LSF submission on Sanger's HPC with Singularity containers and global parameters
configs/ — configuration files for individual pipeline modules
modules/local/irods/storefile/ — module for uploading files to iRODS with checksum verification
modules/local/irods/attachmetadata/ — module for attaching metadata to iRODS collections
modules/local/irods/getmetadata/ — module for retrieving metadata from iRODS collections
modules/local/irods/aggregatemetadata/ — module for aggregating retrieved metadata
modules/local/csv/concat/ — module for concatenating CSV files
examples/ — example input files for different pipeline operations
tests/ — test data and configurations for pipeline validation

Pipeline Workflow

File Discovery: Reads file/directory information from CSV input file
Path Classification: Distinguishes between individual files and directories for processing
File Collection: For directories, recursively gathers all files within the directory structure
File Filtering: Applies ignore patterns to exclude specified file types from upload
iRODS Upload: Transfers files to iRODS with MD5 checksum verification
Metadata Attachment: Attaches custom metadata to iRODS collections (separate operation)
Metadata Retrieval: Retrieves metadata from existing iRODS collections (separate operation)

Pipeline Parameters

Required Parameters (choose one):

--upload — Path to a CSV file containing upload information with columns: path (local filesystem path) and irodspath (target iRODS path) OR
--attach_metadata — Path to a CSV or JSON file containing metadata information with columns: irodspath (target iRODS path) and additional metadata key-value pairs OR
--get_metadata — Path to a CSV file containing iRODS paths with column: irodspath (iRODS path to retrieve metadata from)

Optional Parameters:

--output_dir — Output directory for pipeline results (default: "results")
--publish_mode — File publishing mode (default: "copy")
--ignore_ext — Comma-separated list of file extensions to ignore during upload (default: null)
--remove_existing_metadata — Remove existing metadata before adding new metadata (default: false)
--dup_meta_separator — Separator for splitting multiple values in metadata fields (default: ";")
--metadata_index_name — Column name for iRODS path in metadata files (default: "irodspath")
--join — Join method for metadata operations (default: "outer")
--verbose — Enable verbose output for detailed logging (default: false)

Input File Formats

The pipeline supports three distinct operation modes:

Option 1: File Upload (`--upload`)

CSV file with the following structure:

path,irodspath
/path/to/local/file.txt,/archive/cellgeni/target/file.txt
/path/to/local/directory,/archive/cellgeni/target/directory
/path/to/another/file.csv,/archive/cellgeni/target/data.csv

Where:

path: Absolute path to the local file or directory to upload
irodspath: Target path in iRODS where the file/directory should be stored

Option 2: Metadata Attachment (`--attach_metadata`)

CSV file with the following structure:

irodspath,meta1,meta2,meta3
/archive/cellgeni/target/collection1,value1,value2,value3
/archive/cellgeni/target/collection2,value4,value5,value6
/archive/cellgeni/target/collection3,value7,value8,value9

JSON file with the following structure:

[
  {
    "irodspath": "/archive/cellgeni/target/collection1",
    "meta1": "value1",
    "meta2": "value2",
    "meta3": "value3"
  },
  {
    "irodspath": "/archive/cellgeni/target/collection2",
    "meta1": "value4",
    "meta2": "value5",
    "meta3": "value6"
  }
]

Where:

irodspath: Target iRODS collection path for metadata attachment
Additional fields: Custom metadata key-value pairs to attach to the collection
Values can contain multiple entries separated by the --dup_meta_separator (default: ";") which will create separate metadata entries for each value

Option 3: Metadata Retrieval (`--get_metadata`)

CSV file with the following structure:

irodspath
/archive/cellgeni/collection1
/archive/cellgeni/collection2
/archive/cellgeni/collection3

Where:

irodspath: iRODS collection path to retrieve metadata from

Upload Behavior

For Individual Files

Files are uploaded directly to the specified iRODS path:

Local: /path/to/file.txt
iRODS: /archive/cellgeni/target/file.txt

For Directories

All files within the directory are uploaded while preserving the directory structure:

Local: /path/to/directory/subdir/file.txt
iRODS: /archive/cellgeni/target/directory/subdir/file.txt

File Filtering

When --ignore_ext is specified, files containing the specified extensions are excluded from upload:

--ignore_ext ".bam,.fastq.gz,.tmp"

Metadata Value Splitting

When attaching metadata, values can contain multiple entries separated by a delimiter (default: ";"). Each entry will create a separate metadata attribute with the same key:

Input CSV:

irodspath,authors,keywords
/archive/collection1,"John Doe;Jane Smith","genomics;analysis"

Result: Creates four separate metadata entries:

authors = John Doe
authors = Jane Smith
keywords = genomics
keywords = analysis

Examples

Basic File Upload

Upload files and directories to iRODS:

nextflow run main.nf --upload upload.csv

File Upload with Filtering

Upload files while excluding specific file types:

nextflow run main.nf \
    --upload upload.csv \
    --ignore_ext ".bam,.fastq.gz,.tmp"

Metadata Attachment

Attach metadata to existing iRODS collections:

nextflow run main.nf --attach_metadata metadata.csv

Metadata Attachment with JSON

Attach metadata using JSON format:

nextflow run main.nf --attach_metadata metadata.json

Remove Existing Metadata

Remove existing metadata before adding new metadata:

nextflow run main.nf \
    --attach_metadata metadata.csv \
    --remove_existing_metadata

Metadata Retrieval

Retrieve metadata from existing iRODS collections:

nextflow run main.nf --get_metadata get_metadata.csv

Enable Verbose Output

Get detailed logging information during upload:

nextflow run main.nf \
    --upload upload.csv \
    --verbose

Custom Output Directory

Specify a different output directory for results:

nextflow run main.nf \
    --upload upload.csv \
    --output_dir "my_results"

Custom Metadata Separator

Use semicolons to split multiple values in metadata fields:

nextflow run main.nf \
    --attach_metadata metadata.csv \
    --dup_meta_separator ";"

Custom Metadata Index Column

Use a different column name for iRODS paths:

nextflow run main.nf \
    --attach_metadata metadata.csv \
    --metadata_index_name "irods_collection_path"

Expected Data Structure

For File Upload

The pipeline accepts any file or directory structure. Common use cases include:

Individual files:

/path/to/data.csv
/path/to/analysis.txt
/path/to/results.json

Directory structures:

/path/to/experiment/
├── sample1/
│   ├── data.txt
│   ├── results.csv
│   └── analysis/
│       └── output.json
├── sample2/
│   └── data.txt
└── metadata.tsv

For Metadata Attachment

Metadata is attached to existing iRODS collections. The collections should already exist in iRODS before running the metadata attachment operation.

For Metadata Retrieval

Metadata is retrieved from existing iRODS collections. The collections should already exist in iRODS before running the metadata retrieval operation.

Output Files

Upload Operation

MD5 checksums file: {output_dir}/md5sums.csv
- Contains MD5 checksums for all uploaded files
- Includes both local and iRODS checksums for verification
- Format: collection_id,filepath,irodspath,md5,irodsmd5

Metadata Attachment Operation

Metadata is directly attached to iRODS collections
No local output files are generated

Metadata Retrieval Operation

Metadata file: {output_dir}/metadata.csv
- Contains retrieved metadata from specified iRODS collections
- Aggregated metadata from all queried collections

iRODS Integration

Upload Process

Files are transferred to iRODS using the iput command
MD5 checksums are calculated for both local and iRODS copies
Checksums are compared to ensure data integrity
Upload results are logged and saved to CSV format

Metadata Attachment Process

Metadata key-value pairs are extracted from the input CSV or JSON file
If --remove_existing_metadata is enabled, existing metadata is removed first
Each metadata attribute is attached to the specified iRODS collection
Existing metadata can be updated or new metadata can be added

Metadata Retrieval Process

iRODS collection paths are read from the input CSV file
Metadata is retrieved from each specified iRODS collection using imeta commands
Retrieved metadata is aggregated and formatted into a consolidated CSV file
The final metadata file is saved to the output directory

System Requirements

Nextflow: Version 25.04.4 or higher
Singularity: For containerized execution
iRODS client: Access to iRODS commands (iput, imeta, etc.)
LSF: For job submission on HPC clusters (configured for Sanger's environment)
Python: Python 3.x with pandas for metadata aggregation operations

Testing

The pipeline includes comprehensive testing infrastructure:

nf-test: Testing framework for Nextflow modules and workflows
Test data: Example files located in tests/ directory
Module tests: Individual module testing in modules/*/tests/ directories
Example files: Sample input files in examples/ directory for each operation mode

To run tests:

nf-test test

Error Handling

File not found: Pipeline will fail if specified local files/directories don't exist
iRODS connection: Pipeline will retry failed iRODS operations up to 5 times, then ignore on final failure
Checksum mismatch: Upload failures are reported in the output logs
Invalid CSV format: Pipeline validates CSV headers and structure
Empty metadata: Modules handle empty metadata gracefully with appropriate warnings
Path resolution: Automatic detection of iRODS collections vs data objects, including symbolic links

Monitoring and Logging

The pipeline generates comprehensive reports in the reports/ directory:

Timeline report: Visual timeline of task execution
Execution report: Detailed resource usage and performance metrics
Trace file: Complete execution trace for debugging

All temporary work files are stored in nf-work/ directory and can be cleaned up after successful execution.

Usage Notes

Only one operation mode can be used per pipeline run (--upload OR --attach_metadata OR --get_metadata)
File paths must be absolute paths to avoid ambiguity
iRODS collections for metadata attachment and retrieval must exist before running the pipeline
Metadata files can be in either CSV or JSON format
When using --remove_existing_metadata, all existing metadata will be removed before adding new metadata
Large file uploads may take considerable time depending on network bandwidth
The pipeline is optimized for batch operations rather than single file transfers
Configuration files in configs/ directory allow fine-tuning of individual modules
The pipeline uses Singularity containers with specific images for Python-based operations
All modules include comprehensive metadata documentation in meta.yml files

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
configs		configs
examples		examples
modules/sanger-cellgeni		modules/sanger-cellgeni
tests/data/csv/concat		tests/data/csv/concat
.nf-core.yml		.nf-core.yml
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nf-test.config		nf-test.config

License

cellgeni/nf-upload2irods

Folders and files

Latest commit

History

Repository files navigation

nf-upload2irods

Overview

Contents of Repo:

Pipeline Workflow

Pipeline Parameters

Required Parameters (choose one):

Optional Parameters:

Input File Formats

Option 1: File Upload (--upload)

Option 2: Metadata Attachment (--attach_metadata)

Option 3: Metadata Retrieval (--get_metadata)

Upload Behavior

For Individual Files

For Directories

File Filtering

Metadata Value Splitting

Examples

Basic File Upload

File Upload with Filtering

Metadata Attachment

Metadata Attachment with JSON

Remove Existing Metadata

Metadata Retrieval

Enable Verbose Output

Custom Output Directory

Custom Metadata Separator

Custom Metadata Index Column

Expected Data Structure

For File Upload

For Metadata Attachment

For Metadata Retrieval

Output Files

Upload Operation

Metadata Attachment Operation

Metadata Retrieval Operation

iRODS Integration

Upload Process

Metadata Attachment Process

Metadata Retrieval Process

System Requirements

Testing

Error Handling

Monitoring and Logging

Usage Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Option 1: File Upload (`--upload`)

Option 2: Metadata Attachment (`--attach_metadata`)

Option 3: Metadata Retrieval (`--get_metadata`)

Packages