Skip to content

Use the term 'relevant' instead of 'significant' in try run results #972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ The following is a glossary of domain specific terminology. Although benchmarks
* **test result delta**: the difference between two test results for the same metric and test case.
* **significant test result delta**: a test result delta that is above some threshold that we have determined to be an actual change in performance and not noise.
* **noisy test case**: a test case for which the non-significant test result deltas are on average above some threshold. Non-noisy data tends to be very close to zero, and so a test case that produces non-significant test result deltas that are not close enough to zero on average are classified as noisy.
* **relevant test result delta**: a synonym for *significant test result delta* in situations where the term "significant" might be ambiguous and readers may potentially interpret *significant* as "large" or "statistically significant". For example, in try run results, we use the term relevant instead of significant.
* **highly variable test case**: a test case that frequently produces (over some threshold) significant test result deltas. This differs from sensitive benchmarks in that the deltas aren't necessarily large, they just happen more frequently than one would expect, and it is unclear why the significant deltas are happening. Note: highly variable test cases can be classified as noisy. They are different from classically noisy benchmarks in that they often produce deltas that are above the significance threshold. We cannot therefore be 100% if they are variable because they are noisy or because parts of the compiler being exercised by these benchmarks are just being changed very often.
* **highly sensitive benchmark**: a test case that tends to produce large (over some threshold) significant test result deltas. This differs from highly variable test cases in that it is usually clear what type of changes tend to result in significant test result deltas. For example, some of the stress tests are highly sensitive; they produce significant test result deltas when parts of the compiler that they are stressing change and the percentage change is typically quite large.

Expand Down
60 changes: 42 additions & 18 deletions site/src/comparison.rs
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ async fn populate_report(
}

pub struct ComparisonSummary {
/// Significant comparisons ordered by magnitude
/// Significant comparisons ordered by magnitude from largest to smallest
comparisons: Vec<TestResultComparison>,
}

Expand All @@ -168,8 +168,15 @@ impl ComparisonSummary {
Some(ComparisonSummary { comparisons })
}

/// Gets the overall direction and magnitude of the changes
///
/// Returns `None` if there are no relevant changes.
pub fn direction_and_magnitude(&self) -> Option<(Direction, Magnitude)> {
self.direction().zip(self.magnitude())
}

/// The direction of the changes
pub fn direction(&self) -> Option<Direction> {
fn direction(&self) -> Option<Direction> {
if self.comparisons.len() == 0 {
return None;
}
Expand Down Expand Up @@ -225,6 +232,16 @@ impl ComparisonSummary {
}
}

/// Get the largest magnitude of any change in the comparison.
///
/// Returns `None` if there are no relevant_changes
fn magnitude(&self) -> Option<Magnitude> {
[self.largest_improvement(), self.largest_regression()]
.iter()
.filter_map(|c| c.map(|c| c.magnitude()))
.max_by(|c1, c2| c1.cmp(c2))
}

pub fn relevant_changes<'a>(&'a self) -> [Option<&TestResultComparison>; 2] {
match self.direction() {
Some(Direction::Improvement) => [self.largest_improvement(), None],
Expand All @@ -234,15 +251,12 @@ impl ComparisonSummary {
}
}

pub fn largest_improvement(&self) -> Option<&TestResultComparison> {
self.comparisons
.iter()
.filter(|s| !s.is_regression())
.next()
fn largest_improvement(&self) -> Option<&TestResultComparison> {
self.comparisons.iter().find(|s| s.is_improvement())
}

pub fn largest_regression(&self) -> Option<&TestResultComparison> {
self.comparisons.iter().filter(|s| s.is_regression()).next()
fn largest_regression(&self) -> Option<&TestResultComparison> {
self.comparisons.iter().find(|s| s.is_regression())
}

pub fn confidence(&self) -> ComparisonConfidence {
Expand Down Expand Up @@ -744,6 +758,10 @@ impl TestResultComparison {
b > a
}

fn is_improvement(&self) -> bool {
!self.is_regression()
}

fn is_significant(&self) -> bool {
self.relative_change().abs() > self.signifcance_threshold()
}
Expand Down Expand Up @@ -798,7 +816,7 @@ impl TestResultComparison {
write!(
summary,
"{} {} in {}",
magnitude,
magnitude.display_as_title(),
self.direction(),
match link {
Some(l) => format!("[instruction counts]({})", l),
Expand Down Expand Up @@ -853,8 +871,8 @@ impl std::fmt::Display for Direction {
}

/// The relative size of a performance change
#[derive(Debug)]
enum Magnitude {
#[derive(Debug, PartialOrd, PartialEq, Ord, Eq)]
pub enum Magnitude {
Small,
Medium,
Large,
Expand All @@ -865,17 +883,23 @@ impl Magnitude {
fn is_medium_or_above(&self) -> bool {
!matches!(self, Self::Small)
}
}

impl std::fmt::Display for Magnitude {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
let s = match self {
pub fn display_as_title(&self) -> &'static str {
match self {
Self::Small => "Small",
Self::Medium => "Moderate",
Self::Large => "Large",
Self::VeryLarge => "Very large",
};
f.write_str(s)
}
}

pub fn display(&self) -> &'static str {
match self {
Self::Small => "small",
Self::Medium => "moderate",
Self::Large => "large",
Self::VeryLarge => "very large",
}
}
}

Expand Down
34 changes: 18 additions & 16 deletions site/src/github.rs
Original file line number Diff line number Diff line change
Expand Up @@ -605,29 +605,31 @@ async fn categorize_benchmark(
};
const DISAGREEMENT: &str = "If you disagree with this performance assessment, \
please file an issue in [rust-lang/rustc-perf](https://github.com/rust-lang/rustc-perf/issues/new).";
let (summary, direction) = match ComparisonSummary::summarize_comparison(&comparison) {
Some(s) if s.confidence().is_atleast_probably_relevant() => {
let direction = s.direction().unwrap();
(s, direction)
}
_ => {
return (
format!(
"This benchmark run did not return any significant changes.\n\n{}",
DISAGREEMENT
),
None,
)
}
};
let (summary, (direction, magnitude)) =
match ComparisonSummary::summarize_comparison(&comparison) {
Some(s) if s.confidence().is_atleast_probably_relevant() => {
let direction_and_magnitude = s.direction_and_magnitude().unwrap();
(s, direction_and_magnitude)
}
_ => {
return (
format!(
"This benchmark run did not return any relevant changes.\n\n{}",
DISAGREEMENT
),
None,
)
}
};

let category = match direction {
Direction::Improvement => "improvements 🎉",
Direction::Regression => "regressions 😿",
Direction::Mixed => "mixed results 🤷",
};
let mut result = format!(
"This change led to significant {} in compiler performance.\n",
"This change led to {} relevant {} in compiler performance.\n",
magnitude.display(),
category
);
for change in summary.relevant_changes().iter().filter_map(|c| *c) {
Expand Down