Releases: litmuschaos/litmus
1.9.0-RC1
1.8.0
New Features & Enhancements
-
Introduces the alpha-0 version of Litmus Portal. The portal helps you to execute & visualize chaos workflows, amongst many other things. Learn more about it here
-
Extends Litmus Probes with “Continuous” mode to validate the hypothesis around application behavior during chaos execution as against just at specific points/phases (start & end of chaos)
-
Adds Node & Pod level I/O stress chaos experiments with the ability to tune worker threads and filesystem usage, to the generic experiment suite.
-
Supports network chaos on Containerd & CRI-O runtimes, in addition to Docker.
-
Supports network chaos between distinct microservices (in addition to total interface level egress traffic chaos) specified by their IPs or hostnames/service FQDNs
-
Enhances the ChaosSchedule schema for
repeat
mode by addingIncludedHours
&IncludedDays
. TheStartTime
/EndTime
definitions have been made optional to allow flexibility in being able to run from the point of creation of schedule CR or indefinitely until removal. -
Migrates Cassandra ring disruption experiment to go-based chaoslib
-
Adds the ability to specify a target pod (env:
TARGET_POD
) or node (env:APP_NODE
) as the application/resource under test, apart from randomized selections based on labels. -
Enables the definition of blast radius for an application as a percentage value (
PODS_AFFECTED_PERCENTAGE
), by which an appropriate number of replicas undergo the specified chaos in parallel. -
Improves the litmus chaoslib to take container fs & runtime socket file paths as tunables to support different Kubernetes platforms
-
Includes an additional pumba-based chaoslib for cpu/memory stress that uses external chaos containers (non-pod exec mode)
-
Adds chaos command tunables (for chaos injection & revert) for cpu/memory chaoslib (in pod exec mode) - in order to cover different base images & distros.
-
Supports broader filtering of pods within a namespace when no application labels are provided in
.spec.appInfo
. Users can also choose to skip the specification of application namespace explicitly, in which case the target pods are selected randomly from the ChaosEngine resource namespace. -
Modifies the litmus chaos containers (operator, runner) to run with non-root users
-
Allows the definition of an
INSTANCE_ID
in the ChaosEngine to provide additional context or metadata to an experiment run. This also aids the creation of newer ChaosResult resources instead of patching/overwriting existing ones in case of repeated executions. -
Improves the experiment code standards by fixing the issues listed in the GoGitOps report card for the litmus-go repository.
-
Generates events against the ChaosResult resource to indicate the experiment verdict (Pass, Fail, Stopped). These are useful in annotating monitoring dashboards with experiment results.
-
Enhances the Chaos Exporter to push chaos metrics to AWS CloudWatch
-
Improves the
kubernetes-chaos
helm chart by including options in thevalues.yaml
to selectively install experiments via a whitelist/blacklist. Also maps the experiment names to reflect those on the ChaosHub. -
Enhances the litmus-e2e with increased reporting around component-tests, the addition of e2e tests for new experiments, and Docker-based Gitlab runner for litmus-portal pipelines
-
Provides additional documentation based on experiment enhancements. Updates the get started documentation for general Kubernetes/OpenShift/Rancher platforms.
-
Enhances the litmus-demo scripts to generate a pdf report for the chaos experiments executed
-
Operationalizes the Litmus community Special Interest Groups (SIGs) for Documentation, Observability & Integrations.
Major Bug Fixes
-
Constructs ChaosResult name using experiment names passed from the ChaosExperiment resource instead of hardcoded experiment names
-
Fixes the chaos verification (whether chaos injection has occurred) steps in the container-kill experiment & retains the helper containers in case of errors for further debugging
-
Fixes the chaos event messages to be meaningful & include probe information only when the probes are defined
-
Removes the need for privileged containers to execute disk-fill chaos experiment
-
Handles the case where cpu/memory hog chaos processes are terminated or the target containers are OOM-Killed (this typically occurs when the memory hog/injection value exceeds resource limits set against the pods/containers). The error code 137 is handled appropriately with warning logs and the experiment proceeds with verification steps instead of erroring out/failing (the OOM-Kill is an expected behavior based on inputs provided)
-
Fixes the behavior in node-memory hog experiments where the provided input (percentage of node memory) is measured against the available memory instead of the total system memory
-
Propagates the custom chaos experiment annotations provided in the ChaosExperiment to the helper pods, if any. This is especially useful in cases where annotations decide scheduling or are mapped to certain IAM role/accounts etc.,
Deprecations & Breaking Changes
- The instance count (.spec.schedule.instanceCount) property on the chaosSchedule has been deprecated in favor of maintaining just the minChaosInterval as a means of defining chaos cadence.
Major Known Issues & Limitations
Issue
- The network chaos experiments (especially on docker runtime, using the litmus pumba lib) can end up with a Failed ChaosResult, and the app stuck in
CrashLoopBackoff
state in case of application deployments configured with liveness probes (that are set up to access health/service endpoints). Typically, this lib injects the tc netem rule against the interface by running a “chaos container” that attaches to the network namespace of the target container via the target’s container ID. The same ID is used in a subsequent container launched to revert the rule/chaos. However, with liveness probes, the container is restarted several times during the course of the chaos duration, causing the ID to change. The revert fails, with the network rule still persisting (courtesy the Kubernetes pause container for this app pod) leading to the app entering a CrashLoopBackOff state.
Current Workaround
- Delete/reschedule the target pod manually to recreate the pause container/network namespace.
- Use Target IPs or Hosts to inject the chaos b/w specific microservices while keeping the probe alive.
Note: This is expected to be fixed in a 1.8.x patch release
Issue
- The kubelet-service-kill experiment makes use of systemctl to stop/start the service today. Running this experiment w/o an external LIB_IMAGE & leveraging the experiment image can throw the error
Failed to connect to bus: No data available
as the experiment runs with a non-root user.
Current Workaround
- A standard Ubuntu image that runs as root can be used in a “helper” pod that injects this chaos. However, user-discretion is advised in terms of providing this access.
Issue
- The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail, in spite of chaos being injected successfully - due to the unavailability of certain default utils (that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration) in the target’s image.
Workaround
-
Users can identify the necessary commands to derive and kill the chaos PIDs and pass them to the experiment via env variable
CHAOS_KILL_COMMAND
-
Alternatively, they can make use of the chaos lib that uses external containers with
SYS_ADMIN
docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.
Note: This is expected to be fixed in a 1.8.x patch release
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.8.0.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs
1.8.0-RC2
1.8.0-RC1
chore: (litmus-portal) Refactoring and bug fixes (#2027) This commit has the following changes: - folder structure change for models and useEffect fixes - user redux fixed - graphql documents re-organised Signed-off-by: arkajyotiMukherjee <[email protected]>
1.7.0
New Features & Enhancements
-
Introduces experiment probes to enable declarative specification of entry/exit (success) criteria via the chaosengine. This release supports the Command, Kubernetes & HTTP probe types that can be configured in SoT (Start of Test), EoT (End of Test) & Edge execution modes. With this, users can reuse generic experiments to test a variety of app-specific/context-specific chaos scenarios.
-
Enhances the chaosresult status schema to include the ProbeSuccessPercentage score that gives an overview of the app/infra resilience to a specific chaos experiment run
-
Refines operational modes of litmus: Introduces namespaced operator support in helm charts to support multi-developer/shared cluster use-case with dedicated namespaces, such as in the Okteto Cloud, while updating the admin & standard mode functionality to watch engine resources in litmus & across namespaces respectively
-
Adds functionality to look for target applications in the chaosengine resource namespace if the target namespace is not explicitly specified.
-
Validates/prevents malformed application labels in the chaosengine
-
Improves the ChaosEngine status schema to hold more info (experiment pod names, runner names) that can aid other tools/abstractions running the experiment to derive/parse useful info for further reuse (logs extraction, for ex.)
-
Adds Microsoft Azure Kubernetes Service (AKS) as a supported platform for the generic experiment suite.
-
Adds a new chaos experiment to scale pods/test node autoscale functionality
-
Adds the libraries for the execution of AWS chaos using chaostoolkit, orchestrated by Litmus.
-
Adds support for the specification of host file mounts in chaos experiments
-
Allows setting polling intervals and timeouts for status checks via chaosengine to aid tuning execution for slower environments
-
Removes dependencies on multiple experiment “helper” (auxiliary) images and makes the litmus go-runner self-sufficient in handling the required chaos business logic. This eases maintenance, especially in the case of air-gapped environments / downstream projects that build the litmus components in their respective CI/CD pipelines.
-
Enhances the experiment to “fail fast” upon failed app checks in cases where containers are terminated
-
Upgrades the ansible-runner to use python3
-
Enhances the developer experience for litmus chaos experiments by using Okteto CLI to develop & test experiment business logic in-cluster over repeating image-build-job-run cycles
-
Updates the scaffold utils to generate the experiment bootstrap code based on the latest developments in the experiment structure.
-
Adds chaos-instrumented grafana dashboards for the sock-shop application along with details on setting up monitoring for chaos experiment runs.
-
Adds pre-defined/usable workflows for repeatable execution of node resource chaos in the chaos-charts repo
-
Pushes the technical preview / pre-alpha version of the litmus-portal (available on the master branch).
-
Refactors the litmus-e2e repo/code-structure to simplify the addition of new BDD tests (modularization, removal of bash utils, formatted errors, klog usage, scenario coverage parameters)
-
Adds additional stages in litmus-e2e GitLab pipelines to execute both the go-based & ansible-based chaos experiments
-
Improves github-actions based comment-triggered e2e runs with log details
-
Features a completely revamped & improved ChaosHub
-
Improves the project wiki with more information for users and developers (architecture docs, video tutorials, charters for the Litmus Special Interest Groups)
Major Bug Fixes
-
Patches the chaosengine with the right (‘stopped’) and fixes the event to provide the right reason in cases where app filtering is unsuccessful. This will allow a re-apply of the engine to re-trigger the application.
-
Adds a check to factor-in cordoned (SchedulingDisabled) status of nodes in kubelet & docker-service kill experiments.
-
Provides the tc_image used in network chaos experiments as an experiment tunable over hardcoding in order to support users with internal image registries
-
Decides experiment termination based on chaos container status over that of chaos pod objects to support operations in a service-mesh environment (istio, linkerd) where all pods (including chaos resources) are injected with sidecars. Without this, the experiment runs forever due to the proxy sidecars.
-
Sets the restart policy of the experiments jobs to Never over OnFailure to prevent repeated re-execution for certain experiment failure conditions.
-
Fixes the incorrect eventType for chaos events in cases of failures & skipped executions.
-
Fixes the go-based pod-cpu-hog & pod-memory-hog experiments to execute the chaos processes (commands) in the target container by passing them as a args to shell instance (/bin/sh -c) to account for targets which may run with different entrypoints.
-
Fixes permission issues on the infra helm chart resulting in failed metrics collection
Breaking Changes
-
Stops support for the ansible-runner/executor (EoL) (Not to be confused with the ansible-based chaos experiments)
-
Removes the following repositories:
-
litmuschaos/pages: The operator manifests are available over gh-pages sourced out of litmuschaos/litmus
-
litmuschaos/chaos-helm: The experiments helm chart is also into the litmus-helm repo.
-
litmuschaos/community: The demo procedures & community info are now available within the litmus-demo & the litmus repo respectively.
-
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.7.0.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs
1.6.0
New Features and Enhancements
- Specification of pod and container security context for the experiment resources via chaosexperiment spec
- Introduces pod scheduling policy support via NodeSelector specification on the chaosengine (instance-specific attribute)
- Ability to override experiment images from the chaosengine
- Pushes an experiment execution summary event on the chaosresult resource
- Adds the network chaos experiment to induce packet duplication
- Adds node chaos experiment to force pod evictions via taints
- Adds service chaos experiment to kill docker service on the node
- Extends the golang chaoslib support for all existing chaos experiments in the generic suite
- Validation webhook enhancements to verify if application labels specified in the chaosengine are propagated to pod templates of the applications under test (AUT)
- Additional examples to illustrate litmus chaos-workflows using nginx benchmark using apache benchmark tool with parallel pod-kills
- Migrates the ansible-based chaos experiments to the litmus-ansible repo from litmuschaos/litmus in line with the litmus-go, litmus-python repo structure
- Improves the unit-test based coverage for chaos operator by 30%
- Extends the capability trigger on-demand e2e runs for PRs via GitHub comments to chaos operator
- Adds framework to determine e2e coverage percentage based on comparison of executed tests in the pipeline versus test plan
- Introduces an e2e portal to view e2e pipeline data and coverage
- Improves the Travis-based CI pipeline of the test-tools repo to build images only if respective Dockerfile or scripts are modified instead of building all images irrespective the nature of the commit.
- Increases sources for (helm-based) litmus installation to include helm hub & jfrog chartcenter artifact repositories
- Adds betterci integration to charthub to obtain UI/UX previews for PRs
- Enhances individual experiment documentation with abort procedure & troubleshooting references
- Enhances the experiment failure and uninstall troubleshooting sections to include more conditions
- Includes steps to run chaos experiments on rancher platform
- Includes missing video links/examples for chaos experiments in the generic suite
- Updates all the litmuschaos websites (docs, charthub, project website) based on CNCF guidelines
- Enhances the release guidelines doc with an enhanced release checklist
Major Bug Fixes
- Fixes invalid Jinja template for chaos injection (helper) pod in the kubelet-service-kill experiment
- Specifies an upper limit for the memory hog experiment docs based on the current resource exhaustion approach via dd
- Adds instructions in infra (node) chaos experiments to cordon the AUT before the execution of chaos to prevent the restart of litmus pods
- Fixes a race condition in the pod-delete experiment where the verdict is flagged as fail despite successful execution
- Fixes Kafka experiment failure while trying to derive leader broker for the test topic (partition) due to missing ns and improper regex
- Fixes coredns experiment regression (caused due to introduction of helper pods logic for the pod-delete experiment) due to missing
lib_image in experiment CR
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.6.0.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs
1.5.1
1.5.0
New Features and Enhancements
-
Features a revamped chaos charthub with a more resilient design and improved user experience
-
Introduces ability (github workflows) to trigger individual/multiple e2e tests or complete e2e test-suite for litmus PRs via GitHub comments
-
Adds a new repo litmuschaos/litmus-demo to provide a fully packaged demo environment to run chaos under 10 min
-
Adds node service kill chaos chaos libraries (& kubelet kill chaos experiment on specified nodes)
-
Improves the pod cpu hog experiment by adding go chaoslib to support containerd/crio runtime
-
Introduces chaoslib pattern to choose blast radius / percentage (target) pods and abort chaos on target containers
-
Improves the chaos-scheduler controller to halt/resume chaos
-
Enhances the chaos-schedule CR schema to provide dedicated attributes for the schedule modes (now, once, repeat) over mutually-exclusive fields with enhanced OpenAPI schema validation
-
Introduces ImagePullPolicy as a chaosexperiment CR attribute (.spec.definition.imagePullPolicy) to support usecases where the experiments are needed to be run with locally built images, as with PR-triggered e2e
-
Enhances the container-kill experiment to repeat the chaos per an interval over a total duration with support for containerd/crio runtime.
-
Adds go-based helper pods for pod-delete and container-kill chaos libraries
-
Improves the litmus-go scaffold tool to use lighter base images & improved default events
-
Improves the validating webhook-based admission controller to call out missed annotations on target applications
-
Improves unit-test coverage for chaos-operator
-
Enhances the getting started (chaosengine construction) & troubleshooting docs (uninstallation steps)
Major Bug Fixes
-
Fixes the missing/clustered event generation on litmus-go chaos experiment
-
Fixes operator behavior of triggering chaos disregarding annotation status on the target application
-
Fixes the cluster level running experiment count metric from chaos-exporter
-
Adds concurrent updation of the event counter for each iteration of chaos injection
-
Fixes chaos experiment failures (securitycontext additions) on OpenShift 4.3
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.5.0.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs
1.4.1
[Cherry-Pick for 1.4.1] (#1535) * (chore)roadmap: update roadmap status (#1530) Signed-off-by: ksatchit <[email protected]> * update(helper-pod): Wait till the helper pod come into running state (#1533) Signed-off-by: shubhamchaudhary <[email protected]> Co-authored-by: Shubham Chaudhary <[email protected]>
1.4.1-RC1
fix(pod-delete): Fixing pod-delete chaolib (#1526) (#1528) Signed-off-by: Udit Gaurav <[email protected]> Co-authored-by: UDIT GAURAV <[email protected]>