Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scale-agent): doc scale agent horizontal scaling #2250

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions content/en/plugins/scale-agent/concepts/horizontal-scaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
title: Horizontal Scaling Architecture and Features
linkTitle: Horizontal Scaling
description: >
Learn how the Horizontal Scaling feature helps by distributing operations across Armory Scale Agent replicas in your Armory Continuous Deployment or Spinnaker environment.
aliases:
- /scale-agent/tasks/horizontal-scaling/
---

## Overview of Horizontal Scaling

Rather than sending operations to the first Scale Agent instance that could handle it, horizontal Scaling provides a way to improve operations by distributing them across all the Scale Agent replicas that could handle it.

### How to enable and use Horizontal Scaling

First, familiarize yourself with the architecture and features in this guide. Then you can:

1. {{< linkWithTitle "plugins/scale-agent/tasks/horizontal-scaling/operations-enable.md" >}}

## Horizontal Scaling glossary

- **K8s Operation**: an abstraction of a K8s operation; Get, List, Add, Delete, Patch etc.
- **Dynamic account Operation**: an abstraction of a dynamic account operation; Add or Unregister accounts
- **Endpoint**: the URL segment after the Clouddriver root
- **Request**: an instruction that isn’t fulfilled immediately and can have different outcomes; a request can be done through HTTP by the admin or internally by one of the services.

## Architecture

First is important to understand the main difference between K8s operations and Dynamic account operations.

|K8s |Dynamic account |
|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|
|Are handled by a single Scale Agent Instance |Could be handled by more than one Scale Agent Instance |
|Are processed on every polling cycle; configured by `kubesvc.operations.database.scan` properties |Are processed on demand |
|Assigning on `clouddriver.kubesvc_operation_single_assign` table |Assigning on `clouddriver.kubesvc_operation_multiple_assign` table |


The Scale Agent stores K8s and Dynamic Account operations data in dedicated tables that act like a queue:
- `clouddriver.kubesvc_operation`: Has the information of new received operations
- `clouddriver.kubesvc_operation_single_assign`: Has the information of K8s operations that could be assigned just to a single Scale Agent Instance
- `clouddriver.kubesvc_operation_multiple_assign`: Has the information of dynamic account operations that could be assigned to multiple Scale Agent Instances
- `clouddriver.kubesvc_operation_history`: Has the information of K8s and dynamic account operations responses

### K8s Operations

The Scale Agent Plugin creates a job per Scale Agent Instance registration, this job is in charge of:
1. Fetching pending K8s operations from `clouddriver.kubesvc_operation` table
2. Assigning pending K8s operations on clouddriver.kubesvc_operation_single_assign table
3. Fetch assigned K8s operations from `clouddriver.kubesvc_operation_single_assign` table and send it to Scale Agent

Some important thing to know about it, is that when getting a bad operation response and there is still time to do a retry (based on `kubesvc.cache.operationWaitMs` property), the Scale Agent Plugin does the following:
The Scale Agent Plugin does:
1. Stored the response on `clouddriver.kubesvc_operation_history` table
2. Unassigns the operation from `clouddriver.kubesvc_operation_single_assign` table, so that another or the same Scale Agent instance can take it again

```mermaid
C4Deployment
title Scale Agent Horizontal Scaling Registration Jobs
Boundary(spin, "Armory Continuous Deployment or Spinnaker", "Instance", $borderColor="#0FC2C0") {
Boundary(cd, "Clouddriver", "Service", $borderColor="orange") {
System(sap, "Scale Agent Plugin<br/>", "For each registration creates a job to assign and send<br/>every N milliseconds the maximum number of K8s operations.<br/><br/>N = kubesvc.operations.database.scan.initialDelay | maxDelay<br/>maximum number = kubesvc.operations.database.scan.batchSize")
System(saj0, "Scale Agent Job 0", "")
System(saj1, "Scale Agent Job 1", "")
System(saj2, "Scale Agent Job 2", "")
UpdateElementStyle(saj0, $bgColor="#04AA6D", $borderColor="none")
UpdateElementStyle(saj1, $bgColor="#f44336", $borderColor="none")
UpdateElementStyle(saj2, $bgColor="#555555", $borderColor="none")
}
Boundary(sa, "Armory Scale Agent", "Service", $borderColor="purple") {
System(sar0, "Replica 0", "")
System(sar1, "Replica 1", "")
System(sar2, "Replica 2", "")
UpdateElementStyle(sar0, $bgColor="#04AA6D", $borderColor="none")
UpdateElementStyle(sar1, $bgColor="#f44336", $borderColor="none")
UpdateElementStyle(sar2, $bgColor="#555555", $borderColor="none")
}
Rel(sar0, sap, "Registration", "")
UpdateRelStyle(sar0, sap, $textColor="black", $lineColor="#04AA6D")
Rel(sar1, sap, "Registration", "")
UpdateRelStyle(sar1, sap, $textColor="black", $lineColor="#f44336")
Rel(sar2, sap, "Registration", "")
UpdateRelStyle(sar2, sap, $textColor="black", $lineColor="#555555")
Rel(sap, saj0, "Create")
UpdateRelStyle(sap, saj0, $textColor="black", $lineColor="#04AA6D")
Rel(sap, saj1, "Create")
UpdateRelStyle(sap, saj1, $textColor="black", $lineColor="#f44336", $offsetX="-30", $offsetY="55")
Rel(sap, saj2, "Create")
UpdateRelStyle(sap, saj2, $textColor="black", $lineColor="#555555", $offsetX="-60", $offsetY="155")
BiRel(sar0, saj0, "HandleOp", "request/response")
UpdateRelStyle(sar0, saj0, $textColor="black", $lineColor="#04AA6D", $offsetX="-100", $offsetY="30")
BiRel(sar1, saj1, "HandleOp", "request/response")
UpdateRelStyle(sar1, saj1, $textColor="black", $lineColor="#f44336")
BiRel(sar2, saj2, "HandleOp", "request/response")
UpdateRelStyle(sar2, saj2, $textColor="black", $lineColor="#555555")
}
UpdateLayoutConfig($c4ShapeInRow="1", $c4BoundaryInRow="2")
```

### Dynamic account Operations

Since dynamic account operations requests are less usual, the Scale Agent Plugin flow is as follows:

1. Receive and store the new dynamic account operation on `clouddriver.kubesvc_operation` table
2. Assign the dynamic account operation on `clouddriver.kubesvc_operation_multiple_assign` table; it could be assigned to all connected Scale Agent instance or to instances with the recived zoneId
3. Notify to all instances to fetch pending dynamic account operations from `clouddriver.kubesvc_operation_multiple_assign` table
4. Each instance reads and sends pending dynamic account operations to Scale Agent
5. Wait and send the response back

```mermaid
sequenceDiagram
actor User
participant Plugin
participant Service

User->>Plugin: Send dynamic account operation
Plugin->>Plugin: Store in clouddriver.kubesvc_operation
Plugin->>Plugin: Assign on clouddriver.kubesvc_operation_multiple_assign
Plugin->>Plugin: Notify all to read and send pending dynamic account operations
Plugin->>Service: gRPC HandleOp
Service-->>Plugin: return
Plugin->>Plugin: Store response in clouddriver.kubesvc_operation_history
Plugin-->>User: return
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: Enable and Configure Operations Horizontal Scaling in the Armory Scale Agent
linkTitle: Enable Operations Horizontal Scaling
description: >
Learn how to enable and configure the Operations Horizontal Scaling feature in Armory Scale Agent for Spinnaker and Kubernetes.
---

## {{% heading "prereq" %}}

* You are familiar with {{< linkWithTitle "plugins/scale-agent/concepts/horizontal-scaling" >}}.

## Scale Agent plugin

> Operations Horizontal Scaling was introduce starting with plugin versions v0.13.20/0.12.21/0.11.56.

You should enable Operations Horizontal Scaling by setting `kubesvc.cluster: database` in your plugin configuration. For example:

{{< highlight bash "linenos=table,hl_lines=27-28">}}
spec:
spinnakerConfig:
profiles:
clouddriver:
spinnaker:
extensibility:
repositories:
armory-agent-k8s-spinplug-releases:
enabled: true
url: https://raw.githubusercontent.com/armory-io/agent-k8s-spinplug-releases/master/repositories.json
plugins:
Armory.Kubesvc:
enabled: true
version: 0.13.20 # Replace with a version compatible with your Armory CD version
extensions:
armory.kubesvc:
enabled: true
# Plugin config
kubesvc:
cluster: database
operations:
database:
scan:
batchSize: <int> # (Optional) # requires kubesvc.cluster: database be enable
initialDelay:<int> # (Optional) # requires kubesvc.cluster: database be enable
maxDelay:<int> # (Optional) # requires kubesvc.cluster: database be enable
{{< /highlight >}}

`operations.database.scan`:

* **batchSize**: (Optional) default: 5; The max number of operations that could be assigned to an Scale Agent instance per cycle
* **initialDelay**: (Optional) default: 250; Milliseconds to wait per cycle, when there are pending operations
* **maxDelay**: (Optional) default: 2000; Milliseconds to wait per cycle, when there are not pending operations
8 changes: 4 additions & 4 deletions static/csv/agent/agent-plugin-config-options.csv
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Setting|Type|Default|Description
<code>kubesvc.cache.namespaceExpiryMinutes</code>|integer|0|Disabled by default, set it to a value greater than 0 to enable. Specifies minutes to keep namespace definitions in memory to reduce calls to the database.
<code>kubesvc.cache.onDemandQuickWaitMs</code>|integer|10000|How long to wait for a recache operation.
<code>kubesvc.cache.operationWaitMs</code>|integer|30000|How long to wait for a Kubernetes operation like deploy, scale, delete, or others
<code>kubesvc.cluster</code>|string|none|Type of clustering.<br><code>local</code>: for development only; don’t try to coordinate with other Clouddriver instances<br><code>redis</code>: use Redis to coordinate via pubsub. Redis will be deprecated in a future release.<br><span class='badge badge-primary'>0.10.24+</span><span class='badge badge-primary'>0.9.40</span><span class='badge badge-primary'>0.8.48</span> <code>kubernetes</code>:(Recommended) Requires additional <code>cluster-kubernetes</connected> configuration.
<code>kubesvc.cluster</code>|string|none|Type of clustering.<br><code>local</code>: for development only; don’t try to coordinate with other Clouddriver instances<br><code>redis</code>: use Redis to coordinate via pubsub. Redis will be deprecated in a future release.<br><span class='badge badge-primary'>0.10.24+</span><span class='badge badge-primary'>0.9.40</span><span class='badge badge-primary'>0.8.48</span> <code>kubernetes</code>:(Recommended) Requires additional <code>cluster-kubernetes</code> configuration.<br><span class='badge badge-primary'>0.13.19+</span><span class='badge badge-primary'>0.12.20+</span><span class='badge badge-primary'>0.11.56+</span> <code>database</code>: Makes database act like a queue to coordinate, improves operations distribution, requires additional <code>operations.database.scan</code> configuration.
<code>kubesvc.cluster-kubernetes.kubeconfigFile</code><br><code>kubesvc.cluster-kubernetes.verifySsl</code><br><code>kubesvc.cluster-kubernetes.namespace</code><br><code>kubesvc.cluster-kubernetes.httpPortName</code><br><code>kubesvc.cluster-kubernetes.clouddriverServiceNamePrefix</code>|string<br>boolean<br>string<br>string<br>string<br>|null<br>true<br>null<br>http<br>spin-clouddriver|(Optional) If configured, the plugin uses this file to discover Endpoints. If not configured, it will use the service account mounted to the pod.<br>(Optional) Whether to verify the Kubernetes API cert or not.<br>(Optional) If configured, the plugin watches Endpoints in this namespace. If null, it watches endpoints in the namespace indicated in the file <code>/var/run/secrets/kubernetes.io/serviceaccount/namespace</code><br>(Optional) Name of the port configured in clouddriver Service that forwards traffic to clouddriver http port for REST requests.<br>(Optional) Name prefix of the Kubernetes Service pointing to the Clouddriver standard HTTP port.
<code>kubesvc.credentials.poller.reloadFrequencyMs</code>|long|30000|<span class='badge badge-primary'>2.23.0+</span> <span class='badge badge-primary'>1.23.0+</span> How often the plugin will refresh account credentials to clouddriver in case <code>credentials.poller.enabled</code> is disabled. Otherwise the standard properties of <code>credentials.poller.enabled</code> and <code>credentials.poller.types.kubernetes.reloadFrequencyMs</code> are respected
<code>kubesvc.disableV2Provider</code>|boolean|false|If you don’t need the V2 provider account, set that to true to speed up caching deserialization.
Expand Down Expand Up @@ -41,6 +41,6 @@ Setting|Type|Default|Description
<code>kubesvc.v2-cache-eviction.batch-size</code>|integer|5|<span class='badge badge-primary'>0.10.3+</span> How many Kubernetes kinds to evict for each eviction event.
<code>kubesvc.v2-cache-eviction.millis</code>|integer|200|<span class='badge badge-primary'>0.10.3+</span> The time between evictions in milliseconds. Using a low value can lead to a spike in resource usage when migration occurs.
<code>kubesvc.ops.processTime.metric.result.maxLength</code>|integer|255|How many characters as a maximum could have the <code>kubesvc.ops.processTime.result</code> attribute metric



<code>kubesvc.operations.database.scan.batchSize</code>|integer|5|The max number of operations that could be assigned to an Scale Agent instance per cycle
<code>kubesvc.operations.database.scan.initialDelay</code>|integer|250|Milliseconds to wait per cycle, when there are pending operations
<code>kubesvc.operations.database.scan.maxDelay</code>|integer|2000|Milliseconds to wait per cycle, when there are not pending operations