Update the comparison to other tools. (#736)

tobiasraabe · web-flow · commit a2ffbf609828 · 2025-12-28T22:54:03.000Z
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,8 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
   default pickle protocol.
 - {pull}`???` adapts the interactive debugger integration to Python 3.14's
   updated `pdb` behaviour and keeps pytest-style capturing intact.
+- {pull}`???` updates the comparison to other tools documentation and adds a section on
+  the Common Workflow Language (CWL) and WorkflowHub.
 
 ## 0.5.7 - 2025-11-22
 
diff --git a/docs/source/explanations/comparison_to_other_tools.md b/docs/source/explanations/comparison_to_other_tools.md
@@ -10,124 +10,111 @@ in other WMFs.
 
 ## [snakemake](https://github.com/snakemake/snakemake)
 
-Pros
-
-- Very mature library and probably the most adapted library in the realm of scientific
-  workflow software.
-- Can scale to clusters and use Docker images.
-- Supports Python and R.
-- Automatic test case generation.
-
-Cons
-
-- Need to learn snakemake's syntax which is a mixture of Make and Python.
-- No debug mode.
-- Seems to have no plugin system.
+Snakemake is one of the most widely adopted workflow systems in scientific computing. It
+scales from local execution to clusters and cloud environments, with built-in support
+for containers and conda environments. Workflows are defined using a DSL that combines
+Make-style rules with Python, and can be exported to CWL for portability.
 
 ## [ploomber](https://github.com/ploomber/ploomber)
 
-General
-
-- Strong focus on machine learning pipelines, training, and deployment.
-- Integration with tools such as MLflow, Docker, AWS Batch.
-- Tasks can be defined in yaml, python files, Jupyter notebooks or SQL.
-
-Pros
-
-- Conversion from Jupyter notebooks to tasks via
-  [soorgeon](https://github.com/ploomber/soorgeon).
-
-Cons
-
-- Programming in Jupyter notebooks increases the risk of coding errors (e.g.
-  side-effects).
-- Supports parametrizations in form of cartesian products in `yaml` files, but not more
-  powerful parametrizations.
+Ploomber focuses on machine learning pipelines with strong integration into MLflow,
+Docker, and AWS Batch. Tasks can be defined in YAML, Python files, Jupyter notebooks, or
+SQL, and it can convert notebooks into pipeline tasks.
 
 ## [Waf](https://waf.io)
 
-Pros
-
-- Mature library.
-- Can be extended.
-
-Cons
-
-- Focus on compiling binaries, not research projects.
-- Bus factor of 1.
+Waf is a mature build system primarily designed for compiling software projects. It
+handles complex build dependencies and can be extended with Python.
 
 ## [nextflow](https://github.com/nextflow-io/nextflow)
 
-- Tasks are scripted using Groovy which is a superset of Java.
-- Supports AWS, Google, Azure.
-- Supports Docker, Shifter, Podman, etc.
+Nextflow is a workflow system popular in bioinformatics that runs on AWS, Google Cloud,
+and Azure. It uses Groovy (a JVM language) for scripting and has strong support for
+containers including Docker, Singularity, and Podman.
 
 ## [Kedro](https://github.com/kedro-org/kedro)
 
-Pros
-
-- Mature library, used by some institutions and companies. Created inside McKinsey.
-- Provides the full package: templates, pipelines, deployment
+Kedro is a mature workflow framework developed at McKinsey that provides project
+templates, data catalogs, and deployment tooling. It is designed for production machine
+learning pipelines with a focus on software engineering best practices.
 
 ## [pydoit](https://github.com/pydoit/doit)
 
-General
-
-- A general task runner which focuses on command line tools.
-- You can think of it as an replacement for make.
-- Powers Nikola, a static site generator.
+pydoit is a general-purpose task runner that serves as a Python replacement for Make. It
+focuses on executing command-line tools and powers projects like Nikola, a static site
+generator.
 
 ## [Luigi](https://github.com/spotify/luigi)
 
-General
-
-- A build system written by Spotify.
-- Designed for any kind of long-running batch processes.
-- Integrates with many other tools like databases, Hadoop, Spark, etc..
-
-Cons
-
-- Very complex interface and a lot of stuff you probably don't need.
-- [Development](https://github.com/spotify/luigi/graphs/contributors) seems to stall.
+Luigi is a workflow system built by Spotify for long-running batch processes. It
+integrates with Hadoop, Spark, and various databases for large-scale data pipelines.
+Development has slowed in recent years.
 
 ## [sciluigi](https://github.com/pharmbio/sciluigi)
 
-sciluigi aims to be a lightweight wrapper around luigi.
-
-Cons
-
-- [Development](https://github.com/pharmbio/sciluigi/graphs/contributors) has basically
-  stalled since 2018.
-- Not very popular compared to its lifetime.
+sciluigi is a lightweight wrapper around Luigi aimed at simplifying scientific workflow
+development. It reduces some of Luigi's boilerplate for research use cases. Development
+has stalled since 2018.
 
 ## [scipipe](https://github.com/scipipe/scipipe)
 
-Cons
+SciPipe is a workflow library written in Go for building robust, flexible pipelines
+using Flow-Based Programming principles. It compiles workflows to fast binaries and is
+designed for bioinformatics and cheminformatics applications involving command-line
+tools.
 
-- [Development](https://github.com/scipipe/scipipe/graphs/contributors) slowed down.
-- Written in Go.
+## [SCons](https://github.com/SCons/scons)
 
-## [Scons](https://github.com/SCons/scons)
-
-Pros
-
-- Mature library.
-
-Cons
-
-- Seems to have no plugin system.
+SCons is a mature, cross-platform software construction tool that serves as an improved
+substitute for Make. It uses Python scripts for configuration and has built-in support
+for C, C++, Java, Fortran, and automatic dependency analysis.
 
 ## [pypyr](https://github.com/pypyr/pypyr)
 
-General
+pypyr is a task-runner for automation pipelines defined in YAML. It provides built-in
+steps for common operations like loops, conditionals, retries, and error handling
+without requiring custom code, and is often used for CI/CD and DevOps automation.
+
+## [ZenML](https://github.com/zenml-io/zenml)
 
-- A general task-runner with task defined in yaml files.
+ZenML is an MLOps framework for building portable ML pipelines that can run on various
+orchestrators including Kubernetes, AWS SageMaker, GCP Vertex AI, Kubeflow, and Airflow.
+It focuses on productionizing ML workflows with features like automatic
+containerization, artifact tracking, and native caching.
 
-## [zenml](https://github.com/zenml-io/zenml)
+## [Flyte](https://github.com/flyteorg/flyte)
 
-## [flyte](https://github.com/flyteorg/flyte)
+Flyte is a Kubernetes-native workflow orchestration platform for building
+production-grade data and ML pipelines. It provides automatic retries, checkpointing,
+failure recovery, and scales dynamically across cloud providers including AWS, GCP, and
+Azure.
 
 ## [pipefunc](https://github.com/pipefunc/pipefunc)
 
-A tool for executing graphs made out of functions. More focused on computational
-compared to workflow graphs.
+pipefunc is a lightweight library for creating function pipelines as directed acyclic
+graphs (DAGs) in pure Python. It automatically handles execution order, supports
+map-reduce operations, parallel execution, and provides resource profiling.
+
+## [Common Workflow Language (CWL)](https://www.commonwl.org/)
+
+CWL is an open standard for describing data analysis workflows in a portable,
+language-agnostic format. Its primary goal is to enable workflows to be written once and
+executed across different computing environments—from local workstations to clusters,
+cloud, and HPC systems—without modification. Workflows described in CWL can be
+registered on [WorkflowHub](https://workflowhub.eu/) for sharing and discovery following
+FAIR (Findable, Accessible, Interoperable, Reusable) principles.
+
+CWL is particularly prevalent in bioinformatics and life sciences where reproducibility
+across institutions is critical. Tools that support CWL include
+[cwltool](https://github.com/common-workflow-language/cwltool) (the reference
+implementation), [Toil](https://github.com/DataBiosphere/toil),
+[Arvados](https://arvados.org/), and [REANA](https://reanahub.io/). Some workflow
+systems like Snakemake and Nextflow can export workflows to CWL format.
+
+pytask is not a CWL-compliant tool because it operates on a fundamentally different
+model. CWL describes workflows as graphs of command-line tool invocations where data
+flows between tools via files. pytask, in contrast, orchestrates Python functions that
+can execute arbitrary code, manipulate data in memory, call APIs, or perform any
+operation available in Python. This Python-native approach enables features like
+interactive debugging but means pytask workflows cannot be represented in CWL's
+command-line-centric specification.