Skip to content

Separate bot and human work in activity type reports#430

Open
brenton wants to merge 3 commits into
openshift-eng:mainfrom
brenton:bot-activity
Open

Separate bot and human work in activity type reports#430
brenton wants to merge 3 commits into
openshift-eng:mainfrom
brenton:bot-activity

Conversation

@brenton
Copy link
Copy Markdown
Contributor

@brenton brenton commented Apr 22, 2026

Bot-created Jira issues (e.g., ART image-build-failures) dominate activity type distributions, masking actual human engineering work. This adds bot detection via JIRA_LABEL_RHAI labels and separates the analysis throughout the pipeline.

SQL & skill definition (activity-type-report.md):

Add bot_issues CTE with 13 verified bot labels to Phase 3 SQL Deduplicate CTE in components query to prevent label list drift Update Phase 6 text summary to show separate human/bot distributions Update --sample description to document (project, is_bot) stratification classify_issues.py:

Add _get_is_bot() with proper string coercion (Snowflake returns string "false" which is truthy without coercion) Pass through is_bot field using _get_is_bot() instead of raw .get() Print bot/human split in console output
sample_and_estimate.py:

Stratify samples by (project, is_bot) instead of project alone Compute separate Bayesian estimates for human and bot subsets Add total==0 guard to prevent ZeroDivisionError on empty input generate_sankey.py:

Add Human Only / All / Bot Only toggle with three pre-rendered sankeys Add Source column (Human/Bot) to detail table with dropdown filter Generate per-view CI charts for sampling mode
Fix CI chart rendering: move D3 script loads before CI chart scripts Defer rendering of hidden sankeys/CI charts via data-pending attribute Rename misleading 'overall' variable to 'est_section' in generate_summary_stats() test_sample_and_estimate.py:

Add 27 tests covering _get_is_bot() type coercion, missing keys, and stratified_sample() stratum guarantees, proportionality, and determinism

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.

Summary by CodeRabbit

  • New Features

    • Bot detection added; reports now offer Human / All / Bot toggle, separate distributions, counts, and CSV export labeling.
    • New "Source" column shows per-issue Human/Bot and filters/search respect the active view.
    • CLI: added documented usage variants combining --uncategorized with --todo and --all.
  • Bug Fixes / Improvements

    • Sampling and estimates now compute separate human vs bot sample budgets and credible intervals; overall view derived by weighted combination.
  • Documentation

    • Updated examples, JSON/output schema, and guidance for interpreting Human/All/Bot outputs.
  • Tests

    • Added tests covering sampling, sizing, weighting, and bot-flag parsing.

@openshift-ci openshift-ci Bot requested review from bentito and dgoodwin April 22, 2026 13:50
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: brenton
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds bot detection and per-population stratified sampling/reporting to the Snowflake plugin: normalizes an is_bot flag, splits sampling and estimation into human vs bot strata with weighted overall estimates, updates UI/exports for Human/All/Bot views, expands tests, and bumps plugin version to 0.5.0.

Changes

Cohort / File(s) Summary
Bot detection & core sampling
plugins/snowflake/scripts/classify_issues.py, plugins/snowflake/scripts/sample_and_estimate.py
Added _get_is_bot() to normalize bot indicators; refactored sampling to independent human/bot draws (stratified_sample(human_n, bot_n)); added weighted_overall_estimate() and recommend_sample_sizes(); updated CLI semantics for --sample-size and output JSON to include human/bot blocks.
Reporting, UI, and export logic
plugins/snowflake/scripts/generate_sankey.py
Added _get_is_bot(); split data into human/bot partitions; parameterized chart functions (container_id, estimates_key); generate separate sankey/summary/CI per view; added Human/All/Bot toggle, deferred rendering when hidden, per-row is_bot/Source column, view-aware filtering/counts/CSV export.
Documentation
plugins/snowflake/commands/activity-type-report.md
Documented new CLI variants, Snowflake SQL bot_issues CTE and IS_BOT column, updated sampling strategy (separate human/bot sample sizing and post-stratified overall), updated output schema and UI/export field descriptions, and adjusted examples/expectations.
Tests
plugins/snowflake/scripts/test_sample_and_estimate.py
New comprehensive unit tests for _get_is_bot(), stratified_sample() behavior and determinism, recommend_sample_sizes(), and weighted_overall_estimate() weighting/output shape.
Manifests / metadata
.claude-plugin/marketplace.json, docs/data.json, plugins/snowflake/.claude-plugin/plugin.json
Bumped plugin version from 0.4.0 to 0.5.0 across manifests.

Sequence Diagrams

sequenceDiagram
    participant CLI as User/CLI
    participant Classify as Classify Service
    participant Sample as Sampler
    participant Estimator as Estimation Engine
    participant Sankey as Sankey/Report Generator
    participant Browser as Client HTML/JS

    CLI->>Classify: Read raw issues
    Classify->>Classify: _get_is_bot() → emit records with is_bot
    CLI->>Sample: submit issues + sample request
    Sample->>Sample: split into human_pool & bot_pool
    Sample->>Sample: recommend_sample_sizes() → human_n, bot_n
    Sample->>Sample: stratified_sample(human_n, bot_n)
    Sample->>Estimator: human_classified, bot_classified
    Estimator->>Estimator: weighted_overall_estimate() → human, bot, overall
    Estimator->>CLI: write estimates JSON
    CLI->>Sankey: data + estimates
    Sankey->>Sankey: generate views for Human / All / Bot
    Sankey->>Browser: emit HTML with toggle and deferred charts
    Browser->>Browser: user toggles view → render corresponding charts
Loading
sequenceDiagram
    participant UI as User (Toggle)
    participant JS as Report JS
    participant DOM as Container
    participant D3 as Renderer

    UI->>JS: Click Human / All / Bot tab
    JS->>JS: switchView(view)
    JS->>DOM: Show/hide view sections, set data-pending
    JS->>D3: If container visible (clientWidth>0) → render charts
    alt visible
        D3->>DOM: Render sankey & CI
        JS->>JS: Update counts & CSV content scoped to view
    else hidden
        JS->>JS: Defer rendering until visible
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly and accurately describes the main change: separating bot and human work in activity type reports, which is the core objective across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No Real People Names In Style References ✅ Passed Comprehensive search of all modified files found no instances of real people's names used as style references, example prompts, or documentation.
No Assumed Git Remote Names ✅ Passed The pull request does not introduce any hardcoded git remote names. Modified files contain only data processing scripts, documentation, and configuration files with no git operations.
Git Push Safety Rules ✅ Passed No git push commands, force push operations, or automated push workflows found in the pull request.
No Untrusted Mcp Servers ✅ Passed The PR modifies Python files for bot detection logic and documentation without introducing new MCP server installations from untrusted sources.
Ai-Helpers Overlap Detection ✅ Passed PR modifies only snowflake plugin with bot-detection analysis. While JIRA has similarly-named command, they operate on different data sources and serve complementary purposes. No significant overlap found.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
plugins/snowflake/scripts/sample_and_estimate.py (1)

353-380: ⚠️ Potential issue | 🟠 Major

Weight estimates by stratum before reporting sampled distributions.

The sampler intentionally over-represents small (project, is_bot) strata by guaranteeing at least one item, but estimate_distribution() treats the classified sample as if it were simple random. This can bias overall, human-only, and bot-only percentages toward rare strata.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 353 - 380, The
reported distributions are biased because small (project, is_bot) strata were
oversampled and estimate_distribution() is treating samples as simple random;
fix by weighting samples by their stratum sampling probabilities before calling
the estimator. For classified/human_classified/bot_classified, compute
per-stratum population counts (using human_total/bot_total and project-level
totals used in estimate_by_project) and per-stratum sample counts, derive a
sample weight for each record = population_count / sample_count for that stratum
(or equivalently inverse sampling probability), and pass those weights into the
estimator (either extend estimate_distribution to accept a weights argument or
call a new weighted estimator). Update the calls in this block (the overall
estimate_distribution call, the human_estimates and bot_estimates calls, and any
place using estimate_by_project that aggregates) to use the weighted version so
final percentages are unbiased for the stratified sampling.
plugins/snowflake/scripts/generate_sankey.py (1)

761-783: ⚠️ Potential issue | 🟡 Minor

Use the displayed Source labels for search and CSV export.

The table displays is_bot as Human/Bot, but global search sees raw booleans and CSV exports true or blank for the Source column. That makes the new Source column inconsistent with the UI.

🐛 Proposed helper
 function escapeHtml(s) {
   if (s == null) return "";
   return String(s).replace(/&/g,"&amp;").replace(/</g,"&lt;").replace(/>/g,"&gt;").replace(/"/g,"&quot;");
 }
+
+function displayValue(row, key) {
+  if (key === "is_bot") return row.is_bot ? "Bot" : "Human";
+  return row[key] == null ? "" : row[key];
+}
-        if (String(row[COLUMNS[i].key] || "").toLowerCase().indexOf(globalTerm) >= 0) {
+        if (String(displayValue(row, COLUMNS[i].key)).toLowerCase().indexOf(globalTerm) >= 0) {
-      var v = String(row[COLUMNS[j].key] || "").replace(/"/g, '""');
+      var v = String(displayValue(row, COLUMNS[j].key)).replace(/"/g, '""');

Also applies to: 1002-1010

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/generate_sankey.py` around lines 761 - 783, The
global search and CSV export treat the Source column as raw booleans, causing
mismatches with the UI; update applyFilters (and the CSV generation code around
lines handling export, e.g., the export loop at ~1002-1010) to map row.is_bot to
the displayed labels ("Human" / "Bot") before matching or writing: in
applyFilters change globalTerm matching to compare against a computed display
value for is_bot (e.g., var sourceLabel = row.is_bot ? "Bot" : "Human") and
include that label in the per-column and global comparisons, adjust the
colFilters branch for col === "is_bot" to compare against the label instead of
boolean, and change CSV export to write sourceLabel instead of true/blank so
search and exports use the same displayed Source values.
🧹 Nitpick comments (1)
plugins/snowflake/scripts/classify_issues.py (1)

64-69: Extract bot-flag normalization into a shared helper.

_get_is_bot() is duplicated across the classification, sampling, and Sankey scripts. A shared utility would prevent drift if the accepted Snowflake/string representations change later. The provided context shows the same helper in plugins/snowflake/scripts/sample_and_estimate.py:49-56.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/classify_issues.py` around lines 64 - 69, The
_get_is_bot helper logic is duplicated; extract the normalization into a single
shared utility (e.g., a new function normalize_is_bot or get_is_bot) placed in a
common module (such as plugins.snowflake.utils or plugins.snowflake.helpers),
move the existing implementation from _get_is_bot (used in classify_issues.py)
and the identical code in sample_and_estimate.py into that module, then replace
the local _get_is_bot implementations with imports calling the shared function
(update classify_issues._get_is_bot usage to call the new shared function and
remove the duplicate local definitions) so all scripts (classify_issues,
sample_and_estimate, and Sankey scripts) use the same normalized helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 231: The docs claim a default sampling precision of ±2.5% but the
underlying script sample_and_estimate.py defaults to --target-width 0.10 (≈±5%);
either update the command examples in activity-type-report.md to pass
--target-width 0.05 when invoking sample_and_estimate.py (so the behavior
matches the ±2.5% statement) or change the documented default to ±5% and note
the script default --target-width=0.10; apply the same change for the other
matching occurrences in this file (the other example blocks that mention ±2.5%).
- Line 398: The docs/instructions assume presence of human.estimates[] and
bot.estimates[] when the human/bot keys exist, but estimates.json may include
"human" or "bot" keys set to null; update the text to explicitly check that
human and bot are not null before attempting to read human.estimates[] or
bot.estimates[] (e.g., read overall.estimates[] always, then if human != null
use human.estimates[], and if bot != null use bot.estimates[]), and clarify that
usage is read from $RUN_DIR/classified_sample_usage.txt (or
classified_issues_usage.txt in full mode) only when those separate distributions
exist.

In `@plugins/snowflake/scripts/classify_issues.py`:
- Line 319: There are two f-strings used without placeholders: locate the print
calls containing the exact literals "Bot/Human Split:" and "API Usage:" (the
print(f"\nBot/Human Split:") and print(f"\nAPI Usage:") calls) and remove the
unnecessary f prefix so they become plain string prints (e.g.,
print("\nBot/Human Split:") and print("\nAPI Usage:")) to satisfy Ruff F541.

In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 156-160: The code currently falls back to top-level estimates and
mixes sources: compute and use values from the chosen sub-estimate (est_section)
consistently — e.g., set ci_pct = int(est_section.get("confidence", 0.95) * 100)
and read sample_size, total_population, sample_fraction from est_section rather
than estimates; do the same for the stat cards and the loop that iterates over
est_section.get("estimates", []) (and the later block around the 190-203 logic).
Also stop forcing a fallback to the top-level "overall" when a requested
sub-estimate is explicitly None: preserve None (or handle it as an empty
human/bot view) instead of substituting overall so human/bot views don't show
all-work stats.

In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 82-87: The current per-stratum guaranteed allocation loop
(allocations, remaining, by_stratum) can exceed the requested sample size when n
< number of strata; change the loop so you never allocate more than remaining
slots: for each stratum in by_stratum, set allocations[stratum] = min(1,
len(stratum_issues), remaining) (or skip allocation if remaining == 0) then
decrement remaining and break when remaining == 0; this ensures the
one-per-stratum guarantee is capped by n and preserves correct
remaining/sample_fraction accounting.

---

Outside diff comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 761-783: The global search and CSV export treat the Source column
as raw booleans, causing mismatches with the UI; update applyFilters (and the
CSV generation code around lines handling export, e.g., the export loop at
~1002-1010) to map row.is_bot to the displayed labels ("Human" / "Bot") before
matching or writing: in applyFilters change globalTerm matching to compare
against a computed display value for is_bot (e.g., var sourceLabel = row.is_bot
? "Bot" : "Human") and include that label in the per-column and global
comparisons, adjust the colFilters branch for col === "is_bot" to compare
against the label instead of boolean, and change CSV export to write sourceLabel
instead of true/blank so search and exports use the same displayed Source
values.

In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 353-380: The reported distributions are biased because small
(project, is_bot) strata were oversampled and estimate_distribution() is
treating samples as simple random; fix by weighting samples by their stratum
sampling probabilities before calling the estimator. For
classified/human_classified/bot_classified, compute per-stratum population
counts (using human_total/bot_total and project-level totals used in
estimate_by_project) and per-stratum sample counts, derive a sample weight for
each record = population_count / sample_count for that stratum (or equivalently
inverse sampling probability), and pass those weights into the estimator (either
extend estimate_distribution to accept a weights argument or call a new weighted
estimator). Update the calls in this block (the overall estimate_distribution
call, the human_estimates and bot_estimates calls, and any place using
estimate_by_project that aggregates) to use the weighted version so final
percentages are unbiased for the stratified sampling.

---

Nitpick comments:
In `@plugins/snowflake/scripts/classify_issues.py`:
- Around line 64-69: The _get_is_bot helper logic is duplicated; extract the
normalization into a single shared utility (e.g., a new function
normalize_is_bot or get_is_bot) placed in a common module (such as
plugins.snowflake.utils or plugins.snowflake.helpers), move the existing
implementation from _get_is_bot (used in classify_issues.py) and the identical
code in sample_and_estimate.py into that module, then replace the local
_get_is_bot implementations with imports calling the shared function (update
classify_issues._get_is_bot usage to call the new shared function and remove the
duplicate local definitions) so all scripts (classify_issues,
sample_and_estimate, and Sankey scripts) use the same normalized helper.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 03e7886a-1573-473d-b232-9022d0ca0336

📥 Commits

Reviewing files that changed from the base of the PR and between ef00907 and 6212655.

📒 Files selected for processing (5)
  • plugins/snowflake/commands/activity-type-report.md
  • plugins/snowflake/scripts/classify_issues.py
  • plugins/snowflake/scripts/generate_sankey.py
  • plugins/snowflake/scripts/sample_and_estimate.py
  • plugins/snowflake/scripts/test_sample_and_estimate.py

Comment thread plugins/snowflake/commands/activity-type-report.md
Comment thread plugins/snowflake/commands/activity-type-report.md
Comment thread plugins/snowflake/scripts/classify_issues.py
Comment thread plugins/snowflake/scripts/generate_sankey.py
Comment thread plugins/snowflake/scripts/sample_and_estimate.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/snowflake/scripts/generate_sankey.py (1)

1016-1018: ⚠️ Potential issue | 🟡 Minor

Export the Source column as Human/Bot, not raw booleans.

With the new is_bot column, CSV currently writes true for bots and an empty string for humans because false || "" becomes "".

📤 Proposed CSV fix
     for (var j = 0; j < COLUMNS.length; j++) {
-      var v = String(row[COLUMNS[j].key] || "").replace(/"/g, '""');
+      var key = COLUMNS[j].key;
+      var raw = key === "is_bot" ? (row.is_bot ? "Bot" : "Human") : row[key];
+      var v = String(raw || "").replace(/"/g, '""');
       cells.push('"' + v + '"');
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/generate_sankey.py` around lines 1016 - 1018, The
CSV writer currently uses `var v = String(row[COLUMNS[j].key] ||
"").replace(/"/g, '""');` which turns false into "" for the new boolean `is_bot`
column; change the assignment so that when COLUMNS[j].key (or COLUMNS[j].name if
your columns use names) equals 'is_bot' you map the boolean to the strings "Bot"
for true and "Human" for false, otherwise use the original String(value)
fallback and then apply .replace(/"/g,'""'); update the code inside the for-loop
that defines `v` (refer to the loop over `COLUMNS`, the `v` variable and the
`is_bot` key) to implement this mapping.
♻️ Duplicate comments (3)
plugins/snowflake/scripts/sample_and_estimate.py (1)

77-82: ⚠️ Potential issue | 🟠 Major

Cap the one-per-project guarantee by the remaining budget.

When budget is smaller than the number of projects, this still assigns one sample to every project, so the returned sample can exceed the requested budget.

🐛 Proposed allocation fix
     allocations = {}
     remaining = n
     for proj, issues in pool_by_project.items():
-        allocations[proj] = min(1, len(issues))
+        if remaining <= 0:
+            allocations[proj] = 0
+            continue
+        allocations[proj] = min(1, len(issues), remaining)
         remaining -= allocations[proj]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 77 - 82, The
current loop unconditionally gives each project one sample regardless of
remaining budget, causing the total to exceed n; modify the allocation in the
for-loop over pool_by_project so that allocations[proj] is min(1, len(issues),
remaining) (or allocate 1 only if remaining > 0), and only decrement remaining
when you actually allocate >0; reference variables/functions: allocations,
remaining, pool_by_project, n.
plugins/snowflake/commands/activity-type-report.md (1)

399-399: ⚠️ Potential issue | 🟡 Minor

Check human/bot for non-null before reading .estimates[].

estimates.json can include "human": null or "bot": null, so presence of the key alone is not enough before using human.estimates[] or bot.estimates[].

📝 Proposed wording
-Read the estimates from `$RUN_DIR/estimates.json`. For the overall distribution, use `overall.estimates[]` (each with `category`, `posterior_mean`, `ci_low`, `ci_high`). When `human` and `bot` keys are present in the JSON, use `human.estimates[]` and `bot.estimates[]` for the separate distributions. Read usage from `$RUN_DIR/classified_sample_usage.txt` (or `classified_issues_usage.txt` in full mode).
+Read the estimates from `$RUN_DIR/estimates.json`. For the overall distribution, use `overall.estimates[]` (each with `category`, `posterior_mean`, `ci_low`, `ci_high`). When `human` or `bot` is present and non-null in the JSON, use its `.estimates[]` for the separate distribution. Read usage from `$RUN_DIR/classified_sample_usage.txt` (or `classified_issues_usage.txt` in full mode).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/commands/activity-type-report.md` at line 399, When parsing
estimates.json (reading overall.estimates, human.estimates, bot.estimates)
ensure you not only check for the existence of the "human" and "bot" keys but
also verify they are non-null before accessing their .estimates arrays; update
the parsing logic that reads $RUN_DIR/estimates.json to guard accesses like
human.estimates and bot.estimates with a null-check (e.g., if (human &&
human.estimates) / if (bot && bot.estimates)) and fall back to skipping or
treating as empty distributions when null, while still using overall.estimates
for the overall distribution and reading usage from classified_sample_usage.txt
or classified_issues_usage.txt as before.
plugins/snowflake/scripts/generate_sankey.py (1)

156-159: ⚠️ Potential issue | 🟠 Major

Do not fall back to overall for an explicitly missing Human/Bot estimate.

If estimates["human"] or estimates["bot"] is None, the selected view should show no sampled estimate rather than all-work estimates and top-level sample stats.

🐛 Proposed fix
         # Sampling mode: show posterior estimates with credible intervals
         est_section = estimates.get(estimates_key, estimates.get("overall", estimates))
         if est_section is None:
-            est_section = estimates.get("overall", estimates)
+            return '<p style="color:var(--text-muted);">No sampled estimate available for this view.</p>'
         ci_pct = int(estimates.get("confidence", 0.95) * 100)

Also applies to: 192-201

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/generate_sankey.py` around lines 156 - 159, The
code currently replaces an explicit None estimate for a specific view
(est_section) with the top-level "overall" estimates; instead, detect whether
estimates_key exists in the estimates mapping and only fall back to
estimates["overall"] when the key is absent (not when the key exists but its
value is None). Update the selection logic around est_section/estimates_key in
generate_sankey.py (the block that sets est_section and the similar block at
lines ~192-201) to use a membership check (e.g., if estimates_key in estimates:
est_section = estimates[estimates_key] else: est_section =
estimates.get("overall")) so that explicit None values remain None and result in
no sampled estimate being shown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 311: The sample-output code fences for the Activity Type Report examples
use plain ``` fences; update each of those fences to specify the language "text"
so markdownlint MD040 is satisfied — specifically change the three-backtick
openings before the lines containing "Activity Type Report:
$RUN_DIR/activity-type-report.html" and the other two sample-output blocks (the
ones near the snippets that show sample output) from ``` to ```text; ensure all
matching closing fences remain ``` so the blocks render as plain text.

In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 254-259: The current early-return logic treats an unsampled (but
non-empty) group as empty and causes weighted_overall_estimate() to ignore that
group's prior—fix by removing the simple truthy short-circuit and only skip
calling estimate_distribution when the group's true population size is zero;
instead always call estimate_distribution for both human_classified and
bot_classified unless their population/count field (e.g., total, size or N) is
explicitly zero, and then pass both estimates into weighted_overall_estimate();
update the conditions around human_classified and bot_classified so they check
the population-count attribute rather than truthiness before returning.

In `@plugins/snowflake/scripts/test_sample_and_estimate.py`:
- Line 205: The test calls stratified_sample returning "sample, meta" but "meta"
is unused; rename the unused unpacked variable to a prefixed name (e.g., _meta
or _) where stratified_sample(...) is assigned (the line with sample, meta =
stratified_sample(...)) so the linter (Ruff) treats it as intentionally unused;
update the variable name in that assignment only.

---

Outside diff comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 1016-1018: The CSV writer currently uses `var v =
String(row[COLUMNS[j].key] || "").replace(/"/g, '""');` which turns false into
"" for the new boolean `is_bot` column; change the assignment so that when
COLUMNS[j].key (or COLUMNS[j].name if your columns use names) equals 'is_bot'
you map the boolean to the strings "Bot" for true and "Human" for false,
otherwise use the original String(value) fallback and then apply
.replace(/"/g,'""'); update the code inside the for-loop that defines `v` (refer
to the loop over `COLUMNS`, the `v` variable and the `is_bot` key) to implement
this mapping.

---

Duplicate comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 399: When parsing estimates.json (reading overall.estimates,
human.estimates, bot.estimates) ensure you not only check for the existence of
the "human" and "bot" keys but also verify they are non-null before accessing
their .estimates arrays; update the parsing logic that reads
$RUN_DIR/estimates.json to guard accesses like human.estimates and bot.estimates
with a null-check (e.g., if (human && human.estimates) / if (bot &&
bot.estimates)) and fall back to skipping or treating as empty distributions
when null, while still using overall.estimates for the overall distribution and
reading usage from classified_sample_usage.txt or classified_issues_usage.txt as
before.

In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 156-159: The code currently replaces an explicit None estimate for
a specific view (est_section) with the top-level "overall" estimates; instead,
detect whether estimates_key exists in the estimates mapping and only fall back
to estimates["overall"] when the key is absent (not when the key exists but its
value is None). Update the selection logic around est_section/estimates_key in
generate_sankey.py (the block that sets est_section and the similar block at
lines ~192-201) to use a membership check (e.g., if estimates_key in estimates:
est_section = estimates[estimates_key] else: est_section =
estimates.get("overall")) so that explicit None values remain None and result in
no sampled estimate being shown.

In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 77-82: The current loop unconditionally gives each project one
sample regardless of remaining budget, causing the total to exceed n; modify the
allocation in the for-loop over pool_by_project so that allocations[proj] is
min(1, len(issues), remaining) (or allocate 1 only if remaining > 0), and only
decrement remaining when you actually allocate >0; reference
variables/functions: allocations, remaining, pool_by_project, n.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 264318d5-7f1a-47fa-97f4-b98e53f22543

📥 Commits

Reviewing files that changed from the base of the PR and between 6212655 and 56d8770.

📒 Files selected for processing (4)
  • plugins/snowflake/commands/activity-type-report.md
  • plugins/snowflake/scripts/generate_sankey.py
  • plugins/snowflake/scripts/sample_and_estimate.py
  • plugins/snowflake/scripts/test_sample_and_estimate.py

Comment thread plugins/snowflake/commands/activity-type-report.md
Comment thread plugins/snowflake/scripts/sample_and_estimate.py Outdated
Comment thread plugins/snowflake/scripts/test_sample_and_estimate.py Outdated
@brenton brenton force-pushed the bot-activity branch 2 times, most recently from d61caa0 to 8dd9537 Compare April 22, 2026 18:17
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/snowflake/scripts/generate_sankey.py (1)

154-205: ⚠️ Potential issue | 🟠 Major

Null human/bot sub-section can crash generate_summary_stats.

sample_and_estimate.py writes "human": null (or "bot": null) when the classified sample contains only the other population (see main() elif branches writing None). generate_html calls generate_summary_stats(..., estimates_key="human") whenever has_bots is true — which is triggered by bot_data alone, so a bots-only classified sample still takes that branch with human_data=[]. Then:

  • Line 156 estimates_key in estimates is True (the key is present).
  • Line 157 sets est_section = None.
  • Line 160 calls None.get(...)AttributeError.

The corresponding generate_ci_chart has if not sub or "estimates" not in sub: return "", but this function has no equivalent guard. The and est_section guard at line 193 only protects the stat-card branch, not the rows branch above.

🐛 Proposed fix
-        if estimates_key in estimates:
-            est_section = estimates[estimates_key]
-        else:
+        est_section = estimates.get(estimates_key)
+        if est_section is None:
             est_section = estimates.get("overall", estimates)
+        if not est_section or "estimates" not in est_section:
+            return '<p style="color:var(--text-muted);">No sampled estimate available for this view.</p>'
         ci_pct = int(est_section.get("confidence", estimates.get("confidence", 0.95)) * 100)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/generate_sankey.py` around lines 154 - 205,
generate_summary_stats is crashing when estimates contains a key like "human" or
"bot" set to None; fix by treating a null sub-section as absent before using it:
change the logic that sets est_section (and the subsequent use of ci_pct and the
for-loop over est_section.get("estimates", [])) so it only uses
estimates[estimates_key] when that value is truthy and a dict (e.g., if
estimates_key in estimates and estimates[estimates_key] and "estimates" in
estimates[estimates_key] then est_section = estimates[estimates_key] else fall
back to estimates.get("overall", estimates)); also guard the ci_pct calculation
and iteration to use defaults when est_section is None (matching the guard
pattern in generate_ci_chart).
♻️ Duplicate comments (1)
plugins/snowflake/commands/activity-type-report.md (1)

399-399: ⚠️ Potential issue | 🟡 Minor

Still assumes human/bot keys are always non-null when present.

sample_and_estimate.py emits "human": null (or "bot": null) when the classified sample contains only one population. As written, this instruction would direct the agent to read human.estimates[] off null. Suggest clarifying "present and non-null".

📝 Proposed wording
-Read the estimates from `$RUN_DIR/estimates.json`. For the overall distribution, use `overall.estimates[]` (each with `category`, `posterior_mean`, `ci_low`, `ci_high`). When `human` and `bot` keys are present in the JSON, use `human.estimates[]` and `bot.estimates[]` for the separate distributions. Read usage from `$RUN_DIR/classified_sample_usage.txt` (or `classified_issues_usage.txt` in full mode).
+Read the estimates from `$RUN_DIR/estimates.json`. For the overall distribution, use `overall.estimates[]` (each with `category`, `posterior_mean`, `ci_low`, `ci_high`). When `human` and `bot` are present **and non-null**, use `human.estimates[]` and `bot.estimates[]` for the separate distributions. Read usage from `$RUN_DIR/classified_sample_usage.txt` (or `classified_issues_usage.txt` in full mode).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/commands/activity-type-report.md` at line 399, The
docs/instruction assume "human" and "bot" keys are always non-null; ensure the
agent checks that the JSON keys "human" and "bot" both exist and are non-null
before attempting to read human.estimates[] or bot.estimates[] (otherwise fall
back to overall.estimates[]), and reference sample_and_estimate.py behavior
where it may emit "human": null or "bot": null; also keep reading usage from
classified_sample_usage.txt (or classified_issues_usage.txt in full mode) as
before.
🧹 Nitpick comments (4)
plugins/snowflake/scripts/sample_and_estimate.py (4)

53-58: Consider extracting _get_is_bot to a shared module.

This helper is now copy-pasted across classify_issues.py, generate_sankey.py, and this file. If the set of accepted truthy strings ever changes (e.g., "y", "t"), the definitions will drift. A small shared utility module (bot_detection.py or similar) would keep them in lockstep.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 53 - 58,
Extract the duplicated _get_is_bot logic into a new shared utility module (e.g.,
bot_detection.py) and replace the local copies in _get_is_bot (in this file) and
the same helper in classify_issues.py and generate_sankey.py with imports from
that module; move the current logic into a single function (e.g., is_bot(value)
or normalize_is_bot(value)) that accepts the raw value, handles string truthy
forms ("true","1","yes" and any future additions), and returns a boolean, then
update callers to call the shared function instead of their local _get_is_bot.

419-422: Reuse already-computed population counts.

human_pop and bot_pop were computed at lines 330-331 via recommend_sample_sizes() and are in scope here; recomputing human_total/bot_total duplicates work.

♻️ Proposed diff
-        human_classified = [i for i in classified if not _get_is_bot(i)]
-        bot_classified = [i for i in classified if _get_is_bot(i)]
-        human_total = sum(1 for i in all_issues if not _get_is_bot(i))
-        bot_total = total - human_total
+        human_classified = [i for i in classified if not _get_is_bot(i)]
+        bot_classified = [i for i in classified if _get_is_bot(i)]
+        human_total = human_pop
+        bot_total = bot_pop
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 419 - 422, The
code recomputes population counts (human_total and bot_total) even though
recommend_sample_sizes() already computed human_pop and bot_pop in scope;
replace the recomputation by reusing those variables: keep the filtered lists
human_classified and bot_classified but set human_total = human_pop and
bot_total = bot_pop (or remove human_total/bot_total entirely where
appropriate), updating any downstream uses to reference the existing
human_pop/bot_pop to avoid duplicate work and potential inconsistency.

61-96: Function name stratified_sample no longer matches behavior.

The new implementation does two-stratum (human/bot) simple random sampling — the project-level stratification has been removed (intentionally, per commit messages). The module docstring at line 7 still says "stratified samples" and the command docs (activity-type-report.md line 423) say "Within each population, stratifies by project". Consider renaming to e.g. sample_human_bot() or updating the docstrings/docs so the vocabulary matches the implementation, to avoid confusing future readers who expect per-project stratification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 61 - 96, The
function stratified_sample currently performs two-stratum simple random sampling
(human vs bot) but its name and surrounding docs still claim "stratified" /
per-project stratification; either rename the function to a clear name like
sample_human_bot (and update all call sites) or update the module/docstrings and
external docs (e.g., activity-type-report.md) to state that sampling is by
human/bot only with no per-project stratification; update the function
docstring, any references to stratified behavior, and tests/examples to use the
new name or corrected description so implementation and docs match (refer to
stratified_sample, _get_is_bot, and activity-type-report.md).

195-204: Minor: pass strict=False to zip() for clarity.

Both zip calls iterate lists of known-equal length, but being explicit documents intent and silences ruff B905. The outer zip(human_samples, bot_samples) is particularly worth tagging — a future change to dirichlet_sample that returns a truncated result would silently produce combined estimates from a short MC run rather than erroring.

♻️ Proposed diff
-    for h_draw, b_draw in zip(human_samples, bot_samples):
-        weighted = [w_human * h + w_bot * b for h, b in zip(h_draw, b_draw)]
+    for h_draw, b_draw in zip(human_samples, bot_samples, strict=False):
+        weighted = [w_human * h + w_bot * b for h, b in zip(h_draw, b_draw, strict=False)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 195 - 204, The
zip usage in the sampling and weighting loop can silently truncate if lengths
diverge; update the zip calls to explicitly pass strict=False so mismatched
lengths raise immediately: change the outer zip(human_samples, bot_samples) used
when building weighted_samples and the inner zip(h_draw, b_draw) inside the list
comprehension to zip(..., strict=False), referencing rng, human_samples,
bot_samples, dirichlet_sample, and weighted_samples to locate the sites to
modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 423: The docs mistakenly claim per-project stratification while the
implementation (stratified_sample()) only stratifies by is_bot and performs
random sampling within human/bot pools; update the two occurrences that mention
project-level stratification (the sentence "Within each population, stratifies
by project to ensure all projects are represented." and "Stratifies by (project,
is_bot) to ensure both human and bot populations are represented.") to reflect
the actual behavior (e.g., "Stratifies by is_bot and performs random sampling
within each human/bot population" or similar), or alternatively reintroduce
project-level stratification in the stratified_sample() implementation if you
intend to keep the current wording. Ensure the markdown text and any related
examples describe only the implemented stratification logic and mention that
sampling within each pool is simple random sampling.

In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 709-711: The currentViewFilter initialization assumes humans exist
when HAS_BOTS is true and can open the report on an empty population; change the
logic that sets currentViewFilter (currently using HAS_BOTS and the variable
currentViewFilter) to pick whichever side has data by checking counts (e.g.,
inspect human_data.length and bot_data.length or compute counts from TABLE_DATA)
and default to the non-empty view (or accept an explicit initial view passed
from Python); update references to TABLE_DATA and currentViewFilter so the UI
starts on the populated "human" or "bot" view instead of "human" when the human
set is empty.

---

Outside diff comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 154-205: generate_summary_stats is crashing when estimates
contains a key like "human" or "bot" set to None; fix by treating a null
sub-section as absent before using it: change the logic that sets est_section
(and the subsequent use of ci_pct and the for-loop over
est_section.get("estimates", [])) so it only uses estimates[estimates_key] when
that value is truthy and a dict (e.g., if estimates_key in estimates and
estimates[estimates_key] and "estimates" in estimates[estimates_key] then
est_section = estimates[estimates_key] else fall back to
estimates.get("overall", estimates)); also guard the ci_pct calculation and
iteration to use defaults when est_section is None (matching the guard pattern
in generate_ci_chart).

---

Duplicate comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 399: The docs/instruction assume "human" and "bot" keys are always
non-null; ensure the agent checks that the JSON keys "human" and "bot" both
exist and are non-null before attempting to read human.estimates[] or
bot.estimates[] (otherwise fall back to overall.estimates[]), and reference
sample_and_estimate.py behavior where it may emit "human": null or "bot": null;
also keep reading usage from classified_sample_usage.txt (or
classified_issues_usage.txt in full mode) as before.

---

Nitpick comments:
In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 53-58: Extract the duplicated _get_is_bot logic into a new shared
utility module (e.g., bot_detection.py) and replace the local copies in
_get_is_bot (in this file) and the same helper in classify_issues.py and
generate_sankey.py with imports from that module; move the current logic into a
single function (e.g., is_bot(value) or normalize_is_bot(value)) that accepts
the raw value, handles string truthy forms ("true","1","yes" and any future
additions), and returns a boolean, then update callers to call the shared
function instead of their local _get_is_bot.
- Around line 419-422: The code recomputes population counts (human_total and
bot_total) even though recommend_sample_sizes() already computed human_pop and
bot_pop in scope; replace the recomputation by reusing those variables: keep the
filtered lists human_classified and bot_classified but set human_total =
human_pop and bot_total = bot_pop (or remove human_total/bot_total entirely
where appropriate), updating any downstream uses to reference the existing
human_pop/bot_pop to avoid duplicate work and potential inconsistency.
- Around line 61-96: The function stratified_sample currently performs
two-stratum simple random sampling (human vs bot) but its name and surrounding
docs still claim "stratified" / per-project stratification; either rename the
function to a clear name like sample_human_bot (and update all call sites) or
update the module/docstrings and external docs (e.g., activity-type-report.md)
to state that sampling is by human/bot only with no per-project stratification;
update the function docstring, any references to stratified behavior, and
tests/examples to use the new name or corrected description so implementation
and docs match (refer to stratified_sample, _get_is_bot, and
activity-type-report.md).
- Around line 195-204: The zip usage in the sampling and weighting loop can
silently truncate if lengths diverge; update the zip calls to explicitly pass
strict=False so mismatched lengths raise immediately: change the outer
zip(human_samples, bot_samples) used when building weighted_samples and the
inner zip(h_draw, b_draw) inside the list comprehension to zip(...,
strict=False), referencing rng, human_samples, bot_samples, dirichlet_sample,
and weighted_samples to locate the sites to modify.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8d14053a-99d6-4d68-ac28-c909826ab888

📥 Commits

Reviewing files that changed from the base of the PR and between 5b6cea3 and deb2a93.

📒 Files selected for processing (8)
  • .claude-plugin/marketplace.json
  • docs/data.json
  • plugins/snowflake/.claude-plugin/plugin.json
  • plugins/snowflake/commands/activity-type-report.md
  • plugins/snowflake/scripts/classify_issues.py
  • plugins/snowflake/scripts/generate_sankey.py
  • plugins/snowflake/scripts/sample_and_estimate.py
  • plugins/snowflake/scripts/test_sample_and_estimate.py
✅ Files skipped from review due to trivial changes (4)
  • plugins/snowflake/.claude-plugin/plugin.json
  • docs/data.json
  • .claude-plugin/marketplace.json
  • plugins/snowflake/scripts/test_sample_and_estimate.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • plugins/snowflake/scripts/classify_issues.py

Comment thread plugins/snowflake/commands/activity-type-report.md Outdated
Comment on lines 709 to +711
var filteredData = TABLE_DATA.slice();
var colFilters = {};
var currentViewFilter = HAS_BOTS ? "human" : "all";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Default view can land on an empty population.

currentViewFilter = HAS_BOTS ? "human" : "all" assumes humans exist whenever bots do. For a bots-only dataset (has_bots=True, human_data=[]), the report opens on "Human Only (0)" with an empty sankey/summary. Consider defaulting to the populated view when one side is empty (e.g., choose based on human_data.length / bot_data.length, or pass the initial view from Python).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/generate_sankey.py` around lines 709 - 711, The
currentViewFilter initialization assumes humans exist when HAS_BOTS is true and
can open the report on an empty population; change the logic that sets
currentViewFilter (currently using HAS_BOTS and the variable currentViewFilter)
to pick whichever side has data by checking counts (e.g., inspect
human_data.length and bot_data.length or compute counts from TABLE_DATA) and
default to the non-empty view (or accept an explicit initial view passed from
Python); update references to TABLE_DATA and currentViewFilter so the UI starts
on the populated "human" or "bot" view instead of "human" when the human set is
empty.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
plugins/snowflake/scripts/generate_sankey.py (1)

44-49: Optional: extract the duplicated _get_is_bot helper into a shared module.

The exact same _get_is_bot is now defined in both sample_and_estimate.py and generate_sankey.py. If a new bot-flag representation appears (e.g. integer 1, JSON null), only one copy will be updated. Consider promoting it to a tiny shared module under plugins/snowflake/scripts/ and importing it from both places.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/generate_sankey.py` around lines 44 - 49, Extract
the duplicated helper _get_is_bot into a small shared module (e.g.,
plugins/snowflake/scripts/bot_utils.py), move the current implementation there,
and update both generate_sankey.py and sample_and_estimate.py to import
_get_is_bot from that module; ensure the moved function preserves existing
behavior (handling strings like "true"/"1"/"yes" and non-string truthiness) and
run tests to confirm nothing else breaks.
plugins/snowflake/scripts/sample_and_estimate.py (2)

379-383: Minor: empty else "" branch prints a blank line.

When human_pop == 0 (all-bot dataset), the conditional expression evaluates to "", but print("") still emits a blank line. Cleaner to wrap the whole print in an if:

🧹 Proposed change
-        print(f"  Human: {h_sampled} of {human_pop} "
-              f"({h_sampled/human_pop*100:.1f}%)" if human_pop > 0 else "")
-        if bot_pop > 0:
+        if human_pop > 0:
+            print(f"  Human: {h_sampled} of {human_pop} "
+                  f"({h_sampled/human_pop*100:.1f}%)")
+        if bot_pop > 0:
             print(f"  Bot:   {b_sampled} of {bot_pop} "
                   f"({b_sampled/bot_pop*100:.1f}%)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 379 - 383, The
print for human sample uses a conditional expression that yields an empty string
and still calls print, producing a blank line when human_pop == 0; change this
by wrapping the human print in an explicit if check: replace the
conditional-expression print that references h_sampled and human_pop with an if
human_pop > 0: print(...) block so no blank line is emitted (leave the existing
bot print using bot_pop and b_sampled unchanged).

434-466: Optional: factor out the per-population estimate construction to reduce duplication.

The three branches (both, only human, only bot) repeat the same dict-construction shape three times. A small helper would eliminate the duplication and make future schema changes easier:

♻️ Suggested refactor
-        human_estimates = None
-        bot_estimates = None
-        if human_classified and bot_classified:
-            human_estimates = {
-                "population": human_total,
-                "sample_size": len(human_classified),
-                **estimate_distribution(
-                    human_classified, confidence=args.confidence, seed=args.seed
-                ),
-            }
-            bot_estimates = {
-                "population": bot_total,
-                "sample_size": len(bot_classified),
-                **estimate_distribution(
-                    bot_classified, confidence=args.confidence, seed=args.seed
-                ),
-            }
-        elif human_classified:
-            human_estimates = {
-                "population": human_total,
-                "sample_size": len(human_classified),
-                **estimate_distribution(
-                    human_classified, confidence=args.confidence, seed=args.seed
-                ),
-            }
-        elif bot_classified:
-            bot_estimates = {
-                "population": bot_total,
-                "sample_size": len(bot_classified),
-                **estimate_distribution(
-                    bot_classified, confidence=args.confidence, seed=args.seed
-                ),
-            }
+        def _pop_estimate(classified, population):
+            if not classified:
+                return None
+            return {
+                "population": population,
+                "sample_size": len(classified),
+                **estimate_distribution(
+                    classified, confidence=args.confidence, seed=args.seed
+                ),
+            }
+
+        human_estimates = _pop_estimate(human_classified, human_total)
+        bot_estimates = _pop_estimate(bot_classified, bot_total)

Note that this also subtly changes behavior: with the current code, in the "only human" / "only bot" branches the missing side stays None, but here both are computed independently — same end result, fewer branches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 434 - 466, The
code repeats identical dict construction for human_estimates and bot_estimates
across three branches; create a small helper (e.g.,
build_estimate(population_total, classified_list)) that calls
estimate_distribution(classified_list, confidence=args.confidence,
seed=args.seed) and returns {"population": population_total, "sample_size":
len(classified_list), **distribution}, then replace the duplicated blocks that
construct human_estimates and bot_estimates using that helper while preserving
current semantics (leave the other side as None when human_classified or
bot_classified is absent); reference variables/functions: human_estimates,
bot_estimates, human_classified, bot_classified, human_total, bot_total,
estimate_distribution, args.confidence, args.seed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 270-274: The current conditional mixes null-check and width-check
causing a null-deref when document.getElementById('{container_id}') returns
null; change the logic to guard the null case first: call
document.getElementById('{container_id}') into el, if (!el) return (or handle
absence), then separately check if (el.clientWidth === 0) and in that branch
call el.setAttribute('data-pending','true') and return; update the code around
the getElementById('{container_id}') / el usage so getElementById, el, and
setAttribute are referenced in those distinct guarded branches.

---

Nitpick comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 44-49: Extract the duplicated helper _get_is_bot into a small
shared module (e.g., plugins/snowflake/scripts/bot_utils.py), move the current
implementation there, and update both generate_sankey.py and
sample_and_estimate.py to import _get_is_bot from that module; ensure the moved
function preserves existing behavior (handling strings like "true"/"1"/"yes" and
non-string truthiness) and run tests to confirm nothing else breaks.

In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 379-383: The print for human sample uses a conditional expression
that yields an empty string and still calls print, producing a blank line when
human_pop == 0; change this by wrapping the human print in an explicit if check:
replace the conditional-expression print that references h_sampled and human_pop
with an if human_pop > 0: print(...) block so no blank line is emitted (leave
the existing bot print using bot_pop and b_sampled unchanged).
- Around line 434-466: The code repeats identical dict construction for
human_estimates and bot_estimates across three branches; create a small helper
(e.g., build_estimate(population_total, classified_list)) that calls
estimate_distribution(classified_list, confidence=args.confidence,
seed=args.seed) and returns {"population": population_total, "sample_size":
len(classified_list), **distribution}, then replace the duplicated blocks that
construct human_estimates and bot_estimates using that helper while preserving
current semantics (leave the other side as None when human_classified or
bot_classified is absent); reference variables/functions: human_estimates,
bot_estimates, human_classified, bot_classified, human_total, bot_total,
estimate_distribution, args.confidence, args.seed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 14d7deff-c487-4ee4-a8f5-7eb4dd2f12b7

📥 Commits

Reviewing files that changed from the base of the PR and between deb2a93 and 46f2767.

📒 Files selected for processing (3)
  • plugins/snowflake/commands/activity-type-report.md
  • plugins/snowflake/scripts/generate_sankey.py
  • plugins/snowflake/scripts/sample_and_estimate.py

Comment on lines +270 to +274
var el = document.getElementById('{container_id}');
if (!el || el.clientWidth === 0) {{
el.setAttribute('data-pending', 'true');
return;
}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Null-dereference when the CI chart container is missing.

If document.getElementById('{container_id}') returns null, the !el || el.clientWidth === 0 short-circuits to truthy and then immediately calls el.setAttribute(...), which throws TypeError: Cannot read properties of null. Guard the two cases separately:

🐛 Proposed fix
       var el = document.getElementById('{container_id}');
-      if (!el || el.clientWidth === 0) {{
-        el.setAttribute('data-pending', 'true');
-        return;
-      }}
+      if (!el) return;
+      if (el.clientWidth === 0) {{
+        el.setAttribute('data-pending', 'true');
+        return;
+      }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var el = document.getElementById('{container_id}');
if (!el || el.clientWidth === 0) {{
el.setAttribute('data-pending', 'true');
return;
}}
var el = document.getElementById('{container_id}');
if (!el) return;
if (el.clientWidth === 0) {{
el.setAttribute('data-pending', 'true');
return;
}}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/snowflake/scripts/generate_sankey.py` around lines 270 - 274, The
current conditional mixes null-check and width-check causing a null-deref when
document.getElementById('{container_id}') returns null; change the logic to
guard the null case first: call document.getElementById('{container_id}') into
el, if (!el) return (or handle absence), then separately check if
(el.clientWidth === 0) and in that branch call
el.setAttribute('data-pending','true') and return; update the code around the
getElementById('{container_id}') / el usage so getElementById, el, and
setAttribute are referenced in those distinct guarded branches.

@brenton
Copy link
Copy Markdown
Contributor Author

brenton commented Apr 23, 2026

/retest-required

@brenton
Copy link
Copy Markdown
Contributor Author

brenton commented Apr 24, 2026

/test images

brenton and others added 2 commits May 11, 2026 16:46
Bot-created Jira issues (e.g., ART image-build-failures) dominate activity type distributions, masking actual human engineering work. This adds bot detection via JIRA_LABEL_RHAI labels and separates the analysis.

Changes across 4 files in the snowflake plugin:

activity-type-report.md: Added bot_issues CTE to Phase 3 SQL using 13 verified bot labels (auto-created, art:image-build-failure, ai-generated-jira, etc.) identified across 48 HP projects. Updated Phase 6 text summary to show separate human/bot distributions when bots are detected.

classify_issues.py: Pass through IS_BOT field from Snowflake data to classified output. Print bot/human split in console summary.

sample_and_estimate.py: Stratify samples by (project, is_bot) instead of project alone, ensuring both populations are represented. Compute separate Bayesian estimates for human and bot subsets alongside the overall estimate.

generate_sankey.py: Add Human Only/All/Bot Only toggle with three pre-rendered sankeys. Add Source column (Human/Bot) to detail table with dropdown filter. Defer rendering of hidden sankeys via data-pending attribute. When no bot issues exist, the report looks identical to before.
- Fix classify_issues.py to use _get_is_bot() with proper string
  coercion instead of raw .get(), which would miscount string "false"
  from Snowflake as truthy
- Deduplicate bot_issues CTE in activity-type-report.md to prevent
  label list drift between the two SQL blocks
- Add total == 0 guard in sample_and_estimate.py to prevent
  ZeroDivisionError on empty input
- Add 27 tests for _get_is_bot() and stratified_sample() covering
  type coercion, missing keys, stratum guarantees, and determinism
- Rename misleading 'overall' variable to 'est_section' in
  generate_summary_stats()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread plugins/snowflake/skills/setup-snowflake/SKILL.md
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 22, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants