Skip to content

Add bg task for collecting chicken switches from DB #8462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion dev-tools/omdb/src/bin/omdb/nexus/chicken_switches.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

use crate::Omdb;
use crate::check_allow_destructive::DestructiveOperationToken;
use clap::ArgAction;
use clap::Args;
use clap::Subcommand;
use http::StatusCode;
Expand Down Expand Up @@ -33,6 +34,7 @@ pub enum ChickenSwitchesCommands {

#[derive(Debug, Clone, Args)]
pub struct ChickenSwitchesSetArgs {
#[clap(long, action=ArgAction::Set)]
planner_enabled: bool,
}

Expand Down Expand Up @@ -100,7 +102,13 @@ async fn chicken_switches_show(
println!(" modified time: {time_modified}");
println!(" planner enabled: {planner_enabled}");
}
Err(err) => eprintln!("error: {:#}", err),
Err(err) => {
if err.status() == Some(StatusCode::NOT_FOUND) {
println!("No chicken switches enabled");
} else {
eprintln!("error: {:#}", err)
}
}
}

Ok(())
Expand Down
12 changes: 12 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ task: "blueprint_rendezvous"
owned rendezvous tables that other subsystems consume


task: "chickens_switches_watcher"
watch db for chicken switch changes


task: "crdb_node_id_collector"
Collects node IDs of running CockroachDB zones

Expand Down Expand Up @@ -260,6 +264,10 @@ task: "blueprint_rendezvous"
owned rendezvous tables that other subsystems consume


task: "chickens_switches_watcher"
watch db for chicken switch changes


task: "crdb_node_id_collector"
Collects node IDs of running CockroachDB zones

Expand Down Expand Up @@ -451,6 +459,10 @@ task: "blueprint_rendezvous"
owned rendezvous tables that other subsystems consume


task: "chickens_switches_watcher"
watch db for chicken switch changes


task: "crdb_node_id_collector"
Collects node IDs of running CockroachDB zones

Expand Down
18 changes: 18 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,10 @@ task: "blueprint_rendezvous"
owned rendezvous tables that other subsystems consume


task: "chickens_switches_watcher"
watch db for chicken switch changes


task: "crdb_node_id_collector"
Collects node IDs of running CockroachDB zones

Expand Down Expand Up @@ -543,6 +547,13 @@ task: "blueprint_rendezvous"
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: no blueprint

task: "chickens_switches_watcher"
configured period: every <REDACTED_DURATION>s
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
warning: unknown background task: "chickens_switches_watcher" (don't know how to interpret details: Object {"chicken_switches_updated": Bool(false)})

task: "crdb_node_id_collector"
configured period: every <REDACTED_DURATION>m
currently executing: no
Expand Down Expand Up @@ -1083,6 +1094,13 @@ task: "blueprint_rendezvous"
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: no blueprint

task: "chickens_switches_watcher"
configured period: every <REDACTED_DURATION>s
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
warning: unknown background task: "chickens_switches_watcher" (don't know how to interpret details: Object {"chicken_switches_updated": Bool(false)})

task: "crdb_node_id_collector"
configured period: every <REDACTED_DURATION>m
currently executing: no
Expand Down
2 changes: 1 addition & 1 deletion docs/reconfigurator.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ We're being cautious about rolling out that kind of automation. Instead, today,

`omdb` uses the Nexus internal API to do these things. Since this can only be done using `omdb`, Reconfigurator can really only be used by Oxide engineering and support, not customers.

The planner background task is currently disabled by default, but can be enabled by setting the Nexus configuration option `blueprints.disable_planner = false`. To get to the long term vision where the system is doing all this on its own in response to operator input, we'll need to get confidence that continually executing the planner will have no ill effects on working systems. This might involve more operational experience with it, more safeties, and tools for pausing execution, previewing what it _would_ do, etc.
The planner background task is currently disabled by default, but can be enabled via `omdb nexus chicken-switches --planner-enabled`. To get to the long term vision where the system is doing all this on its own in response to operator input, we'll need to get confidence that continually executing the planner will have no ill effects on working systems. This might involve more operational experience with it, more safeties, and tools for pausing execution, previewing what it _would_ do, etc.

== Design patterns

Expand Down
28 changes: 22 additions & 6 deletions nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,8 @@ pub struct BackgroundTaskConfig {
pub webhook_deliverator: WebhookDeliveratorConfig,
/// configuration for SP ereport ingester task
pub sp_ereport_ingester: SpEreportIngesterConfig,
/// reconfigurator runtime configuration
pub chicken_switches: ChickenSwitchesConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit - should these live under BlueprintTasksConfig? So in the TOML, it'd be something like

blueprint.chicken_switches.loader_period_secs = N

or if we only expect this to ever really have the period, maybe eliminate the ChickenSwitchesConfig struct and make it

blueprints.chicken_switches_loader_period_secs = N

}

#[serde_as]
Expand Down Expand Up @@ -594,9 +596,6 @@ pub struct PhantomDiskConfig {
#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct BlueprintTasksConfig {
/// background planner chicken switch
pub disable_planner: bool,

/// period (in seconds) for periodic activations of the background task that
/// reads the latest target blueprint from the database
#[serde_as(as = "DurationSeconds<u64>")]
Expand Down Expand Up @@ -827,6 +826,20 @@ impl Default for SpEreportIngesterConfig {
}
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct ChickenSwitchesConfig {
/// period (in seconds) for periodic activations of this background task
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs: Duration,
}

impl Default for ChickenSwitchesConfig {
fn default() -> Self {
Self { period_secs: Duration::from_secs(5) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a default? (If yes, why is it 5 here but 30 in the config files?)

}
}

/// Configuration for a nexus server
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct PackageConfig {
Expand Down Expand Up @@ -1079,7 +1092,6 @@ mod test {
physical_disk_adoption.period_secs = 30
decommissioned_disk_cleaner.period_secs = 30
phantom_disks.period_secs = 30
blueprints.disable_planner = true
blueprints.period_secs_load = 10
blueprints.period_secs_plan = 60
blueprints.period_secs_execute = 60
Expand Down Expand Up @@ -1111,6 +1123,7 @@ mod test {
webhook_deliverator.first_retry_backoff_secs = 45
webhook_deliverator.second_retry_backoff_secs = 46
sp_ereport_ingester.period_secs = 47
chicken_switches.period_secs = 30
[default_region_allocation_strategy]
type = "random"
seed = 0
Expand Down Expand Up @@ -1247,7 +1260,6 @@ mod test {
period_secs: Duration::from_secs(30),
},
blueprints: BlueprintTasksConfig {
disable_planner: true,
period_secs_load: Duration::from_secs(10),
period_secs_plan: Duration::from_secs(60),
period_secs_execute: Duration::from_secs(60),
Expand Down Expand Up @@ -1333,6 +1345,9 @@ mod test {
sp_ereport_ingester: SpEreportIngesterConfig {
period_secs: Duration::from_secs(47),
},
chicken_switches: ChickenSwitchesConfig {
period_secs: Duration::from_secs(30)
}
},
default_region_allocation_strategy:
crate::nexus_config::RegionAllocationStrategy::Random {
Expand Down Expand Up @@ -1396,7 +1411,6 @@ mod test {
physical_disk_adoption.period_secs = 30
decommissioned_disk_cleaner.period_secs = 30
phantom_disks.period_secs = 30
blueprints.disable_planner = true
blueprints.period_secs_load = 10
blueprints.period_secs_plan = 60
blueprints.period_secs_execute = 60
Expand Down Expand Up @@ -1424,6 +1438,8 @@ mod test {
alert_dispatcher.period_secs = 42
webhook_deliverator.period_secs = 43
sp_ereport_ingester.period_secs = 44
chicken_switches.period_secs = 30

[default_region_allocation_strategy]
type = "random"
"##,
Expand Down
1 change: 1 addition & 0 deletions nexus/background-task-interface/src/init.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ pub struct BackgroundTasks {
pub task_alert_dispatcher: Activator,
pub task_webhook_deliverator: Activator,
pub task_sp_ereport_ingester: Activator,
pub task_chicken_switches_collector: Activator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - maybe

Suggested change
pub task_chicken_switches_collector: Activator,
pub task_chicken_switches_loader: Activator,

I'm not sure we've been consistent with this, but I'd expect "_collector" to be a task that goes out and collects things (e.g., inventory), vs "_loader" is a task that reads stuff from the DB.


// Handles to activate background tasks that do not get used by Nexus
// at-large. These background tasks are implementation details as far as
Expand Down
2 changes: 1 addition & 1 deletion nexus/examples/config-second.toml
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,6 @@ phantom_disks.period_secs = 30
physical_disk_adoption.period_secs = 30
support_bundle_collector.period_secs = 30
decommissioned_disk_cleaner.period_secs = 60
blueprints.disable_planner = true
blueprints.period_secs_load = 10
blueprints.period_secs_plan = 60
blueprints.period_secs_execute = 60
Expand Down Expand Up @@ -151,6 +150,7 @@ alert_dispatcher.period_secs = 60
webhook_deliverator.period_secs = 60
read_only_region_replacement_start.period_secs = 30
sp_ereport_ingester.period_secs = 30
chicken_switches.period_secs = 30

[default_region_allocation_strategy]
# allocate region on 3 random distinct zpools, on 3 random distinct sleds.
Expand Down
2 changes: 1 addition & 1 deletion nexus/examples/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,6 @@ phantom_disks.period_secs = 30
physical_disk_adoption.period_secs = 30
support_bundle_collector.period_secs = 30
decommissioned_disk_cleaner.period_secs = 60
blueprints.disable_planner = true
blueprints.period_secs_load = 10
blueprints.period_secs_plan = 60
blueprints.period_secs_execute = 60
Expand Down Expand Up @@ -137,6 +136,7 @@ alert_dispatcher.period_secs = 60
webhook_deliverator.period_secs = 60
read_only_region_replacement_start.period_secs = 30
sp_ereport_ingester.period_secs = 30
chicken_switches.period_secs = 30

[default_region_allocation_strategy]
# allocate region on 3 random distinct zpools, on 3 random distinct sleds.
Expand Down
19 changes: 18 additions & 1 deletion nexus/src/app/background/init.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ use super::tasks::blueprint_execution;
use super::tasks::blueprint_load;
use super::tasks::blueprint_planner;
use super::tasks::blueprint_rendezvous;
use super::tasks::chicken_switches::ChickenSwitchesCollector;
use super::tasks::crdb_node_id_collector;
use super::tasks::decommissioned_disk_cleaner;
use super::tasks::dns_config;
Expand Down Expand Up @@ -230,6 +231,7 @@ impl BackgroundTasksInitializer {
task_alert_dispatcher: Activator::new(),
task_webhook_deliverator: Activator::new(),
task_sp_ereport_ingester: Activator::new(),
task_chicken_switches_collector: Activator::new(),

task_internal_dns_propagation: Activator::new(),
task_external_dns_propagation: Activator::new(),
Expand Down Expand Up @@ -306,6 +308,7 @@ impl BackgroundTasksInitializer {
task_alert_dispatcher,
task_webhook_deliverator,
task_sp_ereport_ingester,
task_chicken_switches_collector,
// Add new background tasks here. Be sure to use this binding in a
// call to `Driver::register()` below. That's what actually wires
// up the Activator to the corresponding background task.
Expand Down Expand Up @@ -476,13 +479,26 @@ impl BackgroundTasksInitializer {
inventory_watcher
};

let chicken_switches_collector =
ChickenSwitchesCollector::new(datastore.clone());
let chicken_switches_watcher = chicken_switches_collector.watcher();
driver.register(TaskDefinition {
name: "chickens_switches_watcher",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "chickens_switches_watcher",
name: "chicken_switches_watcher",

description: "watch db for chicken switch changes",
period: config.chicken_switches.period_secs,
task_impl: Box::new(chicken_switches_collector),
opctx: opctx.child(BTreeMap::new()),
watchers: vec![],
activator: task_chicken_switches_collector,
});

// Background task: blueprint planner
//
// Replans on inventory collection and changes to the current
// target blueprint.
let blueprint_planner = blueprint_planner::BlueprintPlanner::new(
datastore.clone(),
config.blueprints.disable_planner,
chicken_switches_watcher.clone(),
inventory_watcher.clone(),
rx_blueprint.clone(),
);
Expand All @@ -496,6 +512,7 @@ impl BackgroundTasksInitializer {
watchers: vec![
Box::new(inventory_watcher.clone()),
Box::new(rx_blueprint.clone()),
Box::new(chicken_switches_watcher),
],
activator: task_blueprint_planner,
});
Expand Down
27 changes: 22 additions & 5 deletions nexus/src/app/background/tasks/blueprint_planner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ use nexus_db_queries::context::OpContext;
use nexus_db_queries::db::DataStore;
use nexus_reconfigurator_planning::planner::Planner;
use nexus_reconfigurator_preparation::PlanningInputFromDb;
use nexus_types::deployment::ReconfiguratorChickenSwitches;
use nexus_types::deployment::{Blueprint, BlueprintTarget};
use nexus_types::internal_api::background::BlueprintPlannerStatus;
use omicron_common::api::external::LookupType;
Expand All @@ -24,7 +25,7 @@ use tokio::sync::watch::{self, Receiver, Sender};
/// Background task that runs the update planner.
pub struct BlueprintPlanner {
datastore: Arc<DataStore>,
disabled: bool,
rx_chicken_switches: Receiver<Option<ReconfiguratorChickenSwitches>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this is an Option, but does it need to be? Presumably for every possible chicken switch, we have some default value (i.e., the thing we'd choose if this is None). Could we populate the channel with a ReconfiguratorChickenSwitches with those values set to avoid having to deal with None at all in this task?

Oh, as I write that I guess we don't have a default for version. How gross would it be to fill in a version: 0 for this "made up default" set of switches?

rx_inventory: Receiver<Option<CollectionUuid>>,
rx_blueprint: Receiver<Option<Arc<(BlueprintTarget, Blueprint)>>>,
tx_blueprint: Sender<Option<Arc<(BlueprintTarget, Blueprint)>>>,
Expand All @@ -33,12 +34,18 @@ pub struct BlueprintPlanner {
impl BlueprintPlanner {
pub fn new(
datastore: Arc<DataStore>,
disabled: bool,
rx_chicken_switches: Receiver<Option<ReconfiguratorChickenSwitches>>,
rx_inventory: Receiver<Option<CollectionUuid>>,
rx_blueprint: Receiver<Option<Arc<(BlueprintTarget, Blueprint)>>>,
) -> Self {
let (tx_blueprint, _) = watch::channel(None);
Self { datastore, disabled, rx_inventory, rx_blueprint, tx_blueprint }
Self {
datastore,
rx_chicken_switches,
rx_inventory,
rx_blueprint,
tx_blueprint,
}
}

pub fn watcher(
Expand All @@ -51,7 +58,8 @@ impl BlueprintPlanner {
/// If it is different from the current target blueprint,
/// save it and make it the current target.
pub async fn plan(&mut self, opctx: &OpContext) -> BlueprintPlannerStatus {
if self.disabled {
let switches = self.rx_chicken_switches.borrow_and_update().clone();
if switches.is_none_or(|s| !s.planner_enabled) {
debug!(&opctx.log, "blueprint planning disabled, doing nothing");
return BlueprintPlannerStatus::Disabled;
}
Expand Down Expand Up @@ -251,6 +259,7 @@ mod test {
use super::*;
use crate::app::background::tasks::blueprint_load::TargetBlueprintLoader;
use crate::app::background::tasks::inventory_collection::InventoryCollector;
use nexus_inventory::now_db_precision;
use nexus_test_utils_macros::nexus_test;

type ControlPlaneTestContext =
Expand Down Expand Up @@ -291,10 +300,18 @@ mod test {
let rx_collector = collector.watcher();
collector.activate(&opctx).await;

// Enable the planner
let (_tx, chicken_switches_collector_rx) =
watch::channel(Some(ReconfiguratorChickenSwitches {
version: 1,
planner_enabled: true,
time_modified: now_db_precision(),
}));

// Finally, spin up the planner background task.
let mut planner = BlueprintPlanner::new(
datastore.clone(),
false,
chicken_switches_collector_rx,
rx_collector,
rx_loader.clone(),
);
Expand Down
Loading
Loading