Separate bot and human work in activity type reports#430
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: brenton The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds bot detection and per-population stratified sampling/reporting to the Snowflake plugin: normalizes an Changes
Sequence DiagramssequenceDiagram
participant CLI as User/CLI
participant Classify as Classify Service
participant Sample as Sampler
participant Estimator as Estimation Engine
participant Sankey as Sankey/Report Generator
participant Browser as Client HTML/JS
CLI->>Classify: Read raw issues
Classify->>Classify: _get_is_bot() → emit records with is_bot
CLI->>Sample: submit issues + sample request
Sample->>Sample: split into human_pool & bot_pool
Sample->>Sample: recommend_sample_sizes() → human_n, bot_n
Sample->>Sample: stratified_sample(human_n, bot_n)
Sample->>Estimator: human_classified, bot_classified
Estimator->>Estimator: weighted_overall_estimate() → human, bot, overall
Estimator->>CLI: write estimates JSON
CLI->>Sankey: data + estimates
Sankey->>Sankey: generate views for Human / All / Bot
Sankey->>Browser: emit HTML with toggle and deferred charts
Browser->>Browser: user toggles view → render corresponding charts
sequenceDiagram
participant UI as User (Toggle)
participant JS as Report JS
participant DOM as Container
participant D3 as Renderer
UI->>JS: Click Human / All / Bot tab
JS->>JS: switchView(view)
JS->>DOM: Show/hide view sections, set data-pending
JS->>D3: If container visible (clientWidth>0) → render charts
alt visible
D3->>DOM: Render sankey & CI
JS->>JS: Update counts & CSV content scoped to view
else hidden
JS->>JS: Defer rendering until visible
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
plugins/snowflake/scripts/sample_and_estimate.py (1)
353-380:⚠️ Potential issue | 🟠 MajorWeight estimates by stratum before reporting sampled distributions.
The sampler intentionally over-represents small
(project, is_bot)strata by guaranteeing at least one item, butestimate_distribution()treats the classified sample as if it were simple random. This can bias overall, human-only, and bot-only percentages toward rare strata.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 353 - 380, The reported distributions are biased because small (project, is_bot) strata were oversampled and estimate_distribution() is treating samples as simple random; fix by weighting samples by their stratum sampling probabilities before calling the estimator. For classified/human_classified/bot_classified, compute per-stratum population counts (using human_total/bot_total and project-level totals used in estimate_by_project) and per-stratum sample counts, derive a sample weight for each record = population_count / sample_count for that stratum (or equivalently inverse sampling probability), and pass those weights into the estimator (either extend estimate_distribution to accept a weights argument or call a new weighted estimator). Update the calls in this block (the overall estimate_distribution call, the human_estimates and bot_estimates calls, and any place using estimate_by_project that aggregates) to use the weighted version so final percentages are unbiased for the stratified sampling.plugins/snowflake/scripts/generate_sankey.py (1)
761-783:⚠️ Potential issue | 🟡 MinorUse the displayed Source labels for search and CSV export.
The table displays
is_botasHuman/Bot, but global search sees raw booleans and CSV exportstrueor blank for the Source column. That makes the new Source column inconsistent with the UI.🐛 Proposed helper
function escapeHtml(s) { if (s == null) return ""; return String(s).replace(/&/g,"&").replace(/</g,"<").replace(/>/g,">").replace(/"/g,"""); } + +function displayValue(row, key) { + if (key === "is_bot") return row.is_bot ? "Bot" : "Human"; + return row[key] == null ? "" : row[key]; +}- if (String(row[COLUMNS[i].key] || "").toLowerCase().indexOf(globalTerm) >= 0) { + if (String(displayValue(row, COLUMNS[i].key)).toLowerCase().indexOf(globalTerm) >= 0) {- var v = String(row[COLUMNS[j].key] || "").replace(/"/g, '""'); + var v = String(displayValue(row, COLUMNS[j].key)).replace(/"/g, '""');Also applies to: 1002-1010
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/generate_sankey.py` around lines 761 - 783, The global search and CSV export treat the Source column as raw booleans, causing mismatches with the UI; update applyFilters (and the CSV generation code around lines handling export, e.g., the export loop at ~1002-1010) to map row.is_bot to the displayed labels ("Human" / "Bot") before matching or writing: in applyFilters change globalTerm matching to compare against a computed display value for is_bot (e.g., var sourceLabel = row.is_bot ? "Bot" : "Human") and include that label in the per-column and global comparisons, adjust the colFilters branch for col === "is_bot" to compare against the label instead of boolean, and change CSV export to write sourceLabel instead of true/blank so search and exports use the same displayed Source values.
🧹 Nitpick comments (1)
plugins/snowflake/scripts/classify_issues.py (1)
64-69: Extract bot-flag normalization into a shared helper.
_get_is_bot()is duplicated across the classification, sampling, and Sankey scripts. A shared utility would prevent drift if the accepted Snowflake/string representations change later. The provided context shows the same helper inplugins/snowflake/scripts/sample_and_estimate.py:49-56.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/classify_issues.py` around lines 64 - 69, The _get_is_bot helper logic is duplicated; extract the normalization into a single shared utility (e.g., a new function normalize_is_bot or get_is_bot) placed in a common module (such as plugins.snowflake.utils or plugins.snowflake.helpers), move the existing implementation from _get_is_bot (used in classify_issues.py) and the identical code in sample_and_estimate.py into that module, then replace the local _get_is_bot implementations with imports calling the shared function (update classify_issues._get_is_bot usage to call the new shared function and remove the duplicate local definitions) so all scripts (classify_issues, sample_and_estimate, and Sankey scripts) use the same normalized helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 231: The docs claim a default sampling precision of ±2.5% but the
underlying script sample_and_estimate.py defaults to --target-width 0.10 (≈±5%);
either update the command examples in activity-type-report.md to pass
--target-width 0.05 when invoking sample_and_estimate.py (so the behavior
matches the ±2.5% statement) or change the documented default to ±5% and note
the script default --target-width=0.10; apply the same change for the other
matching occurrences in this file (the other example blocks that mention ±2.5%).
- Line 398: The docs/instructions assume presence of human.estimates[] and
bot.estimates[] when the human/bot keys exist, but estimates.json may include
"human" or "bot" keys set to null; update the text to explicitly check that
human and bot are not null before attempting to read human.estimates[] or
bot.estimates[] (e.g., read overall.estimates[] always, then if human != null
use human.estimates[], and if bot != null use bot.estimates[]), and clarify that
usage is read from $RUN_DIR/classified_sample_usage.txt (or
classified_issues_usage.txt in full mode) only when those separate distributions
exist.
In `@plugins/snowflake/scripts/classify_issues.py`:
- Line 319: There are two f-strings used without placeholders: locate the print
calls containing the exact literals "Bot/Human Split:" and "API Usage:" (the
print(f"\nBot/Human Split:") and print(f"\nAPI Usage:") calls) and remove the
unnecessary f prefix so they become plain string prints (e.g.,
print("\nBot/Human Split:") and print("\nAPI Usage:")) to satisfy Ruff F541.
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 156-160: The code currently falls back to top-level estimates and
mixes sources: compute and use values from the chosen sub-estimate (est_section)
consistently — e.g., set ci_pct = int(est_section.get("confidence", 0.95) * 100)
and read sample_size, total_population, sample_fraction from est_section rather
than estimates; do the same for the stat cards and the loop that iterates over
est_section.get("estimates", []) (and the later block around the 190-203 logic).
Also stop forcing a fallback to the top-level "overall" when a requested
sub-estimate is explicitly None: preserve None (or handle it as an empty
human/bot view) instead of substituting overall so human/bot views don't show
all-work stats.
In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 82-87: The current per-stratum guaranteed allocation loop
(allocations, remaining, by_stratum) can exceed the requested sample size when n
< number of strata; change the loop so you never allocate more than remaining
slots: for each stratum in by_stratum, set allocations[stratum] = min(1,
len(stratum_issues), remaining) (or skip allocation if remaining == 0) then
decrement remaining and break when remaining == 0; this ensures the
one-per-stratum guarantee is capped by n and preserves correct
remaining/sample_fraction accounting.
---
Outside diff comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 761-783: The global search and CSV export treat the Source column
as raw booleans, causing mismatches with the UI; update applyFilters (and the
CSV generation code around lines handling export, e.g., the export loop at
~1002-1010) to map row.is_bot to the displayed labels ("Human" / "Bot") before
matching or writing: in applyFilters change globalTerm matching to compare
against a computed display value for is_bot (e.g., var sourceLabel = row.is_bot
? "Bot" : "Human") and include that label in the per-column and global
comparisons, adjust the colFilters branch for col === "is_bot" to compare
against the label instead of boolean, and change CSV export to write sourceLabel
instead of true/blank so search and exports use the same displayed Source
values.
In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 353-380: The reported distributions are biased because small
(project, is_bot) strata were oversampled and estimate_distribution() is
treating samples as simple random; fix by weighting samples by their stratum
sampling probabilities before calling the estimator. For
classified/human_classified/bot_classified, compute per-stratum population
counts (using human_total/bot_total and project-level totals used in
estimate_by_project) and per-stratum sample counts, derive a sample weight for
each record = population_count / sample_count for that stratum (or equivalently
inverse sampling probability), and pass those weights into the estimator (either
extend estimate_distribution to accept a weights argument or call a new weighted
estimator). Update the calls in this block (the overall estimate_distribution
call, the human_estimates and bot_estimates calls, and any place using
estimate_by_project that aggregates) to use the weighted version so final
percentages are unbiased for the stratified sampling.
---
Nitpick comments:
In `@plugins/snowflake/scripts/classify_issues.py`:
- Around line 64-69: The _get_is_bot helper logic is duplicated; extract the
normalization into a single shared utility (e.g., a new function
normalize_is_bot or get_is_bot) placed in a common module (such as
plugins.snowflake.utils or plugins.snowflake.helpers), move the existing
implementation from _get_is_bot (used in classify_issues.py) and the identical
code in sample_and_estimate.py into that module, then replace the local
_get_is_bot implementations with imports calling the shared function (update
classify_issues._get_is_bot usage to call the new shared function and remove the
duplicate local definitions) so all scripts (classify_issues,
sample_and_estimate, and Sankey scripts) use the same normalized helper.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 03e7886a-1573-473d-b232-9022d0ca0336
📒 Files selected for processing (5)
plugins/snowflake/commands/activity-type-report.mdplugins/snowflake/scripts/classify_issues.pyplugins/snowflake/scripts/generate_sankey.pyplugins/snowflake/scripts/sample_and_estimate.pyplugins/snowflake/scripts/test_sample_and_estimate.py
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/snowflake/scripts/generate_sankey.py (1)
1016-1018:⚠️ Potential issue | 🟡 MinorExport the Source column as
Human/Bot, not raw booleans.With the new
is_botcolumn, CSV currently writestruefor bots and an empty string for humans becausefalse || ""becomes"".📤 Proposed CSV fix
for (var j = 0; j < COLUMNS.length; j++) { - var v = String(row[COLUMNS[j].key] || "").replace(/"/g, '""'); + var key = COLUMNS[j].key; + var raw = key === "is_bot" ? (row.is_bot ? "Bot" : "Human") : row[key]; + var v = String(raw || "").replace(/"/g, '""'); cells.push('"' + v + '"'); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/generate_sankey.py` around lines 1016 - 1018, The CSV writer currently uses `var v = String(row[COLUMNS[j].key] || "").replace(/"/g, '""');` which turns false into "" for the new boolean `is_bot` column; change the assignment so that when COLUMNS[j].key (or COLUMNS[j].name if your columns use names) equals 'is_bot' you map the boolean to the strings "Bot" for true and "Human" for false, otherwise use the original String(value) fallback and then apply .replace(/"/g,'""'); update the code inside the for-loop that defines `v` (refer to the loop over `COLUMNS`, the `v` variable and the `is_bot` key) to implement this mapping.
♻️ Duplicate comments (3)
plugins/snowflake/scripts/sample_and_estimate.py (1)
77-82:⚠️ Potential issue | 🟠 MajorCap the one-per-project guarantee by the remaining budget.
When
budgetis smaller than the number of projects, this still assigns one sample to every project, so the returned sample can exceed the requested budget.🐛 Proposed allocation fix
allocations = {} remaining = n for proj, issues in pool_by_project.items(): - allocations[proj] = min(1, len(issues)) + if remaining <= 0: + allocations[proj] = 0 + continue + allocations[proj] = min(1, len(issues), remaining) remaining -= allocations[proj]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 77 - 82, The current loop unconditionally gives each project one sample regardless of remaining budget, causing the total to exceed n; modify the allocation in the for-loop over pool_by_project so that allocations[proj] is min(1, len(issues), remaining) (or allocate 1 only if remaining > 0), and only decrement remaining when you actually allocate >0; reference variables/functions: allocations, remaining, pool_by_project, n.plugins/snowflake/commands/activity-type-report.md (1)
399-399:⚠️ Potential issue | 🟡 MinorCheck
human/botfor non-null before reading.estimates[].
estimates.jsoncan include"human": nullor"bot": null, so presence of the key alone is not enough before usinghuman.estimates[]orbot.estimates[].📝 Proposed wording
-Read the estimates from `$RUN_DIR/estimates.json`. For the overall distribution, use `overall.estimates[]` (each with `category`, `posterior_mean`, `ci_low`, `ci_high`). When `human` and `bot` keys are present in the JSON, use `human.estimates[]` and `bot.estimates[]` for the separate distributions. Read usage from `$RUN_DIR/classified_sample_usage.txt` (or `classified_issues_usage.txt` in full mode). +Read the estimates from `$RUN_DIR/estimates.json`. For the overall distribution, use `overall.estimates[]` (each with `category`, `posterior_mean`, `ci_low`, `ci_high`). When `human` or `bot` is present and non-null in the JSON, use its `.estimates[]` for the separate distribution. Read usage from `$RUN_DIR/classified_sample_usage.txt` (or `classified_issues_usage.txt` in full mode).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/commands/activity-type-report.md` at line 399, When parsing estimates.json (reading overall.estimates, human.estimates, bot.estimates) ensure you not only check for the existence of the "human" and "bot" keys but also verify they are non-null before accessing their .estimates arrays; update the parsing logic that reads $RUN_DIR/estimates.json to guard accesses like human.estimates and bot.estimates with a null-check (e.g., if (human && human.estimates) / if (bot && bot.estimates)) and fall back to skipping or treating as empty distributions when null, while still using overall.estimates for the overall distribution and reading usage from classified_sample_usage.txt or classified_issues_usage.txt as before.plugins/snowflake/scripts/generate_sankey.py (1)
156-159:⚠️ Potential issue | 🟠 MajorDo not fall back to
overallfor an explicitly missing Human/Bot estimate.If
estimates["human"]orestimates["bot"]isNone, the selected view should show no sampled estimate rather than all-work estimates and top-level sample stats.🐛 Proposed fix
# Sampling mode: show posterior estimates with credible intervals est_section = estimates.get(estimates_key, estimates.get("overall", estimates)) if est_section is None: - est_section = estimates.get("overall", estimates) + return '<p style="color:var(--text-muted);">No sampled estimate available for this view.</p>' ci_pct = int(estimates.get("confidence", 0.95) * 100)Also applies to: 192-201
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/generate_sankey.py` around lines 156 - 159, The code currently replaces an explicit None estimate for a specific view (est_section) with the top-level "overall" estimates; instead, detect whether estimates_key exists in the estimates mapping and only fall back to estimates["overall"] when the key is absent (not when the key exists but its value is None). Update the selection logic around est_section/estimates_key in generate_sankey.py (the block that sets est_section and the similar block at lines ~192-201) to use a membership check (e.g., if estimates_key in estimates: est_section = estimates[estimates_key] else: est_section = estimates.get("overall")) so that explicit None values remain None and result in no sampled estimate being shown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 311: The sample-output code fences for the Activity Type Report examples
use plain ``` fences; update each of those fences to specify the language "text"
so markdownlint MD040 is satisfied — specifically change the three-backtick
openings before the lines containing "Activity Type Report:
$RUN_DIR/activity-type-report.html" and the other two sample-output blocks (the
ones near the snippets that show sample output) from ``` to ```text; ensure all
matching closing fences remain ``` so the blocks render as plain text.
In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 254-259: The current early-return logic treats an unsampled (but
non-empty) group as empty and causes weighted_overall_estimate() to ignore that
group's prior—fix by removing the simple truthy short-circuit and only skip
calling estimate_distribution when the group's true population size is zero;
instead always call estimate_distribution for both human_classified and
bot_classified unless their population/count field (e.g., total, size or N) is
explicitly zero, and then pass both estimates into weighted_overall_estimate();
update the conditions around human_classified and bot_classified so they check
the population-count attribute rather than truthiness before returning.
In `@plugins/snowflake/scripts/test_sample_and_estimate.py`:
- Line 205: The test calls stratified_sample returning "sample, meta" but "meta"
is unused; rename the unused unpacked variable to a prefixed name (e.g., _meta
or _) where stratified_sample(...) is assigned (the line with sample, meta =
stratified_sample(...)) so the linter (Ruff) treats it as intentionally unused;
update the variable name in that assignment only.
---
Outside diff comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 1016-1018: The CSV writer currently uses `var v =
String(row[COLUMNS[j].key] || "").replace(/"/g, '""');` which turns false into
"" for the new boolean `is_bot` column; change the assignment so that when
COLUMNS[j].key (or COLUMNS[j].name if your columns use names) equals 'is_bot'
you map the boolean to the strings "Bot" for true and "Human" for false,
otherwise use the original String(value) fallback and then apply
.replace(/"/g,'""'); update the code inside the for-loop that defines `v` (refer
to the loop over `COLUMNS`, the `v` variable and the `is_bot` key) to implement
this mapping.
---
Duplicate comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 399: When parsing estimates.json (reading overall.estimates,
human.estimates, bot.estimates) ensure you not only check for the existence of
the "human" and "bot" keys but also verify they are non-null before accessing
their .estimates arrays; update the parsing logic that reads
$RUN_DIR/estimates.json to guard accesses like human.estimates and bot.estimates
with a null-check (e.g., if (human && human.estimates) / if (bot &&
bot.estimates)) and fall back to skipping or treating as empty distributions
when null, while still using overall.estimates for the overall distribution and
reading usage from classified_sample_usage.txt or classified_issues_usage.txt as
before.
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 156-159: The code currently replaces an explicit None estimate for
a specific view (est_section) with the top-level "overall" estimates; instead,
detect whether estimates_key exists in the estimates mapping and only fall back
to estimates["overall"] when the key is absent (not when the key exists but its
value is None). Update the selection logic around est_section/estimates_key in
generate_sankey.py (the block that sets est_section and the similar block at
lines ~192-201) to use a membership check (e.g., if estimates_key in estimates:
est_section = estimates[estimates_key] else: est_section =
estimates.get("overall")) so that explicit None values remain None and result in
no sampled estimate being shown.
In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 77-82: The current loop unconditionally gives each project one
sample regardless of remaining budget, causing the total to exceed n; modify the
allocation in the for-loop over pool_by_project so that allocations[proj] is
min(1, len(issues), remaining) (or allocate 1 only if remaining > 0), and only
decrement remaining when you actually allocate >0; reference
variables/functions: allocations, remaining, pool_by_project, n.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 264318d5-7f1a-47fa-97f4-b98e53f22543
📒 Files selected for processing (4)
plugins/snowflake/commands/activity-type-report.mdplugins/snowflake/scripts/generate_sankey.pyplugins/snowflake/scripts/sample_and_estimate.pyplugins/snowflake/scripts/test_sample_and_estimate.py
d61caa0 to
8dd9537
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/snowflake/scripts/generate_sankey.py (1)
154-205:⚠️ Potential issue | 🟠 MajorNull
human/botsub-section can crashgenerate_summary_stats.
sample_and_estimate.pywrites"human": null(or"bot": null) when the classified sample contains only the other population (seemain()elif branches writingNone).generate_htmlcallsgenerate_summary_stats(..., estimates_key="human")wheneverhas_botsis true — which is triggered bybot_dataalone, so a bots-only classified sample still takes that branch withhuman_data=[]. Then:
- Line 156
estimates_key in estimatesis True (the key is present).- Line 157 sets
est_section = None.- Line 160 calls
None.get(...)→AttributeError.The corresponding
generate_ci_charthasif not sub or "estimates" not in sub: return "", but this function has no equivalent guard. Theand est_sectionguard at line 193 only protects the stat-card branch, not the rows branch above.🐛 Proposed fix
- if estimates_key in estimates: - est_section = estimates[estimates_key] - else: + est_section = estimates.get(estimates_key) + if est_section is None: est_section = estimates.get("overall", estimates) + if not est_section or "estimates" not in est_section: + return '<p style="color:var(--text-muted);">No sampled estimate available for this view.</p>' ci_pct = int(est_section.get("confidence", estimates.get("confidence", 0.95)) * 100)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/generate_sankey.py` around lines 154 - 205, generate_summary_stats is crashing when estimates contains a key like "human" or "bot" set to None; fix by treating a null sub-section as absent before using it: change the logic that sets est_section (and the subsequent use of ci_pct and the for-loop over est_section.get("estimates", [])) so it only uses estimates[estimates_key] when that value is truthy and a dict (e.g., if estimates_key in estimates and estimates[estimates_key] and "estimates" in estimates[estimates_key] then est_section = estimates[estimates_key] else fall back to estimates.get("overall", estimates)); also guard the ci_pct calculation and iteration to use defaults when est_section is None (matching the guard pattern in generate_ci_chart).
♻️ Duplicate comments (1)
plugins/snowflake/commands/activity-type-report.md (1)
399-399:⚠️ Potential issue | 🟡 MinorStill assumes
human/botkeys are always non-null when present.
sample_and_estimate.pyemits"human": null(or"bot": null) when the classified sample contains only one population. As written, this instruction would direct the agent to readhuman.estimates[]offnull. Suggest clarifying "present and non-null".📝 Proposed wording
-Read the estimates from `$RUN_DIR/estimates.json`. For the overall distribution, use `overall.estimates[]` (each with `category`, `posterior_mean`, `ci_low`, `ci_high`). When `human` and `bot` keys are present in the JSON, use `human.estimates[]` and `bot.estimates[]` for the separate distributions. Read usage from `$RUN_DIR/classified_sample_usage.txt` (or `classified_issues_usage.txt` in full mode). +Read the estimates from `$RUN_DIR/estimates.json`. For the overall distribution, use `overall.estimates[]` (each with `category`, `posterior_mean`, `ci_low`, `ci_high`). When `human` and `bot` are present **and non-null**, use `human.estimates[]` and `bot.estimates[]` for the separate distributions. Read usage from `$RUN_DIR/classified_sample_usage.txt` (or `classified_issues_usage.txt` in full mode).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/commands/activity-type-report.md` at line 399, The docs/instruction assume "human" and "bot" keys are always non-null; ensure the agent checks that the JSON keys "human" and "bot" both exist and are non-null before attempting to read human.estimates[] or bot.estimates[] (otherwise fall back to overall.estimates[]), and reference sample_and_estimate.py behavior where it may emit "human": null or "bot": null; also keep reading usage from classified_sample_usage.txt (or classified_issues_usage.txt in full mode) as before.
🧹 Nitpick comments (4)
plugins/snowflake/scripts/sample_and_estimate.py (4)
53-58: Consider extracting_get_is_botto a shared module.This helper is now copy-pasted across
classify_issues.py,generate_sankey.py, and this file. If the set of accepted truthy strings ever changes (e.g.,"y","t"), the definitions will drift. A small shared utility module (bot_detection.pyor similar) would keep them in lockstep.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 53 - 58, Extract the duplicated _get_is_bot logic into a new shared utility module (e.g., bot_detection.py) and replace the local copies in _get_is_bot (in this file) and the same helper in classify_issues.py and generate_sankey.py with imports from that module; move the current logic into a single function (e.g., is_bot(value) or normalize_is_bot(value)) that accepts the raw value, handles string truthy forms ("true","1","yes" and any future additions), and returns a boolean, then update callers to call the shared function instead of their local _get_is_bot.
419-422: Reuse already-computed population counts.
human_popandbot_popwere computed at lines 330-331 viarecommend_sample_sizes()and are in scope here; recomputinghuman_total/bot_totalduplicates work.♻️ Proposed diff
- human_classified = [i for i in classified if not _get_is_bot(i)] - bot_classified = [i for i in classified if _get_is_bot(i)] - human_total = sum(1 for i in all_issues if not _get_is_bot(i)) - bot_total = total - human_total + human_classified = [i for i in classified if not _get_is_bot(i)] + bot_classified = [i for i in classified if _get_is_bot(i)] + human_total = human_pop + bot_total = bot_pop🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 419 - 422, The code recomputes population counts (human_total and bot_total) even though recommend_sample_sizes() already computed human_pop and bot_pop in scope; replace the recomputation by reusing those variables: keep the filtered lists human_classified and bot_classified but set human_total = human_pop and bot_total = bot_pop (or remove human_total/bot_total entirely where appropriate), updating any downstream uses to reference the existing human_pop/bot_pop to avoid duplicate work and potential inconsistency.
61-96: Function namestratified_sampleno longer matches behavior.The new implementation does two-stratum (human/bot) simple random sampling — the project-level stratification has been removed (intentionally, per commit messages). The module docstring at line 7 still says "stratified samples" and the command docs (
activity-type-report.mdline 423) say "Within each population, stratifies by project". Consider renaming to e.g.sample_human_bot()or updating the docstrings/docs so the vocabulary matches the implementation, to avoid confusing future readers who expect per-project stratification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 61 - 96, The function stratified_sample currently performs two-stratum simple random sampling (human vs bot) but its name and surrounding docs still claim "stratified" / per-project stratification; either rename the function to a clear name like sample_human_bot (and update all call sites) or update the module/docstrings and external docs (e.g., activity-type-report.md) to state that sampling is by human/bot only with no per-project stratification; update the function docstring, any references to stratified behavior, and tests/examples to use the new name or corrected description so implementation and docs match (refer to stratified_sample, _get_is_bot, and activity-type-report.md).
195-204: Minor: passstrict=Falsetozip()for clarity.Both
zipcalls iterate lists of known-equal length, but being explicit documents intent and silences ruff B905. The outerzip(human_samples, bot_samples)is particularly worth tagging — a future change todirichlet_samplethat returns a truncated result would silently produce combined estimates from a short MC run rather than erroring.♻️ Proposed diff
- for h_draw, b_draw in zip(human_samples, bot_samples): - weighted = [w_human * h + w_bot * b for h, b in zip(h_draw, b_draw)] + for h_draw, b_draw in zip(human_samples, bot_samples, strict=False): + weighted = [w_human * h + w_bot * b for h, b in zip(h_draw, b_draw, strict=False)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 195 - 204, The zip usage in the sampling and weighting loop can silently truncate if lengths diverge; update the zip calls to explicitly pass strict=False so mismatched lengths raise immediately: change the outer zip(human_samples, bot_samples) used when building weighted_samples and the inner zip(h_draw, b_draw) inside the list comprehension to zip(..., strict=False), referencing rng, human_samples, bot_samples, dirichlet_sample, and weighted_samples to locate the sites to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 423: The docs mistakenly claim per-project stratification while the
implementation (stratified_sample()) only stratifies by is_bot and performs
random sampling within human/bot pools; update the two occurrences that mention
project-level stratification (the sentence "Within each population, stratifies
by project to ensure all projects are represented." and "Stratifies by (project,
is_bot) to ensure both human and bot populations are represented.") to reflect
the actual behavior (e.g., "Stratifies by is_bot and performs random sampling
within each human/bot population" or similar), or alternatively reintroduce
project-level stratification in the stratified_sample() implementation if you
intend to keep the current wording. Ensure the markdown text and any related
examples describe only the implemented stratification logic and mention that
sampling within each pool is simple random sampling.
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 709-711: The currentViewFilter initialization assumes humans exist
when HAS_BOTS is true and can open the report on an empty population; change the
logic that sets currentViewFilter (currently using HAS_BOTS and the variable
currentViewFilter) to pick whichever side has data by checking counts (e.g.,
inspect human_data.length and bot_data.length or compute counts from TABLE_DATA)
and default to the non-empty view (or accept an explicit initial view passed
from Python); update references to TABLE_DATA and currentViewFilter so the UI
starts on the populated "human" or "bot" view instead of "human" when the human
set is empty.
---
Outside diff comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 154-205: generate_summary_stats is crashing when estimates
contains a key like "human" or "bot" set to None; fix by treating a null
sub-section as absent before using it: change the logic that sets est_section
(and the subsequent use of ci_pct and the for-loop over
est_section.get("estimates", [])) so it only uses estimates[estimates_key] when
that value is truthy and a dict (e.g., if estimates_key in estimates and
estimates[estimates_key] and "estimates" in estimates[estimates_key] then
est_section = estimates[estimates_key] else fall back to
estimates.get("overall", estimates)); also guard the ci_pct calculation and
iteration to use defaults when est_section is None (matching the guard pattern
in generate_ci_chart).
---
Duplicate comments:
In `@plugins/snowflake/commands/activity-type-report.md`:
- Line 399: The docs/instruction assume "human" and "bot" keys are always
non-null; ensure the agent checks that the JSON keys "human" and "bot" both
exist and are non-null before attempting to read human.estimates[] or
bot.estimates[] (otherwise fall back to overall.estimates[]), and reference
sample_and_estimate.py behavior where it may emit "human": null or "bot": null;
also keep reading usage from classified_sample_usage.txt (or
classified_issues_usage.txt in full mode) as before.
---
Nitpick comments:
In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 53-58: Extract the duplicated _get_is_bot logic into a new shared
utility module (e.g., bot_detection.py) and replace the local copies in
_get_is_bot (in this file) and the same helper in classify_issues.py and
generate_sankey.py with imports from that module; move the current logic into a
single function (e.g., is_bot(value) or normalize_is_bot(value)) that accepts
the raw value, handles string truthy forms ("true","1","yes" and any future
additions), and returns a boolean, then update callers to call the shared
function instead of their local _get_is_bot.
- Around line 419-422: The code recomputes population counts (human_total and
bot_total) even though recommend_sample_sizes() already computed human_pop and
bot_pop in scope; replace the recomputation by reusing those variables: keep the
filtered lists human_classified and bot_classified but set human_total =
human_pop and bot_total = bot_pop (or remove human_total/bot_total entirely
where appropriate), updating any downstream uses to reference the existing
human_pop/bot_pop to avoid duplicate work and potential inconsistency.
- Around line 61-96: The function stratified_sample currently performs
two-stratum simple random sampling (human vs bot) but its name and surrounding
docs still claim "stratified" / per-project stratification; either rename the
function to a clear name like sample_human_bot (and update all call sites) or
update the module/docstrings and external docs (e.g., activity-type-report.md)
to state that sampling is by human/bot only with no per-project stratification;
update the function docstring, any references to stratified behavior, and
tests/examples to use the new name or corrected description so implementation
and docs match (refer to stratified_sample, _get_is_bot, and
activity-type-report.md).
- Around line 195-204: The zip usage in the sampling and weighting loop can
silently truncate if lengths diverge; update the zip calls to explicitly pass
strict=False so mismatched lengths raise immediately: change the outer
zip(human_samples, bot_samples) used when building weighted_samples and the
inner zip(h_draw, b_draw) inside the list comprehension to zip(...,
strict=False), referencing rng, human_samples, bot_samples, dirichlet_sample,
and weighted_samples to locate the sites to modify.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 8d14053a-99d6-4d68-ac28-c909826ab888
📒 Files selected for processing (8)
.claude-plugin/marketplace.jsondocs/data.jsonplugins/snowflake/.claude-plugin/plugin.jsonplugins/snowflake/commands/activity-type-report.mdplugins/snowflake/scripts/classify_issues.pyplugins/snowflake/scripts/generate_sankey.pyplugins/snowflake/scripts/sample_and_estimate.pyplugins/snowflake/scripts/test_sample_and_estimate.py
✅ Files skipped from review due to trivial changes (4)
- plugins/snowflake/.claude-plugin/plugin.json
- docs/data.json
- .claude-plugin/marketplace.json
- plugins/snowflake/scripts/test_sample_and_estimate.py
🚧 Files skipped from review as they are similar to previous changes (1)
- plugins/snowflake/scripts/classify_issues.py
| var filteredData = TABLE_DATA.slice(); | ||
| var colFilters = {}; | ||
| var currentViewFilter = HAS_BOTS ? "human" : "all"; |
There was a problem hiding this comment.
Default view can land on an empty population.
currentViewFilter = HAS_BOTS ? "human" : "all" assumes humans exist whenever bots do. For a bots-only dataset (has_bots=True, human_data=[]), the report opens on "Human Only (0)" with an empty sankey/summary. Consider defaulting to the populated view when one side is empty (e.g., choose based on human_data.length / bot_data.length, or pass the initial view from Python).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugins/snowflake/scripts/generate_sankey.py` around lines 709 - 711, The
currentViewFilter initialization assumes humans exist when HAS_BOTS is true and
can open the report on an empty population; change the logic that sets
currentViewFilter (currently using HAS_BOTS and the variable currentViewFilter)
to pick whichever side has data by checking counts (e.g., inspect
human_data.length and bot_data.length or compute counts from TABLE_DATA) and
default to the non-empty view (or accept an explicit initial view passed from
Python); update references to TABLE_DATA and currentViewFilter so the UI starts
on the populated "human" or "bot" view instead of "human" when the human set is
empty.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
plugins/snowflake/scripts/generate_sankey.py (1)
44-49: Optional: extract the duplicated_get_is_bothelper into a shared module.The exact same
_get_is_botis now defined in bothsample_and_estimate.pyandgenerate_sankey.py. If a new bot-flag representation appears (e.g. integer1, JSONnull), only one copy will be updated. Consider promoting it to a tiny shared module underplugins/snowflake/scripts/and importing it from both places.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/generate_sankey.py` around lines 44 - 49, Extract the duplicated helper _get_is_bot into a small shared module (e.g., plugins/snowflake/scripts/bot_utils.py), move the current implementation there, and update both generate_sankey.py and sample_and_estimate.py to import _get_is_bot from that module; ensure the moved function preserves existing behavior (handling strings like "true"/"1"/"yes" and non-string truthiness) and run tests to confirm nothing else breaks.plugins/snowflake/scripts/sample_and_estimate.py (2)
379-383: Minor: emptyelse ""branch prints a blank line.When
human_pop == 0(all-bot dataset), the conditional expression evaluates to"", butprint("")still emits a blank line. Cleaner to wrap the wholeif:🧹 Proposed change
- print(f" Human: {h_sampled} of {human_pop} " - f"({h_sampled/human_pop*100:.1f}%)" if human_pop > 0 else "") - if bot_pop > 0: + if human_pop > 0: + print(f" Human: {h_sampled} of {human_pop} " + f"({h_sampled/human_pop*100:.1f}%)") + if bot_pop > 0: print(f" Bot: {b_sampled} of {bot_pop} " f"({b_sampled/bot_pop*100:.1f}%)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 379 - 383, The print for human sample uses a conditional expression that yields an empty string and still calls print, producing a blank line when human_pop == 0; change this by wrapping the human print in an explicit if check: replace the conditional-expression print that references h_sampled and human_pop with an if human_pop > 0: print(...) block so no blank line is emitted (leave the existing bot print using bot_pop and b_sampled unchanged).
434-466: Optional: factor out the per-population estimate construction to reduce duplication.The three branches (
both,only human,only bot) repeat the same dict-construction shape three times. A small helper would eliminate the duplication and make future schema changes easier:♻️ Suggested refactor
- human_estimates = None - bot_estimates = None - if human_classified and bot_classified: - human_estimates = { - "population": human_total, - "sample_size": len(human_classified), - **estimate_distribution( - human_classified, confidence=args.confidence, seed=args.seed - ), - } - bot_estimates = { - "population": bot_total, - "sample_size": len(bot_classified), - **estimate_distribution( - bot_classified, confidence=args.confidence, seed=args.seed - ), - } - elif human_classified: - human_estimates = { - "population": human_total, - "sample_size": len(human_classified), - **estimate_distribution( - human_classified, confidence=args.confidence, seed=args.seed - ), - } - elif bot_classified: - bot_estimates = { - "population": bot_total, - "sample_size": len(bot_classified), - **estimate_distribution( - bot_classified, confidence=args.confidence, seed=args.seed - ), - } + def _pop_estimate(classified, population): + if not classified: + return None + return { + "population": population, + "sample_size": len(classified), + **estimate_distribution( + classified, confidence=args.confidence, seed=args.seed + ), + } + + human_estimates = _pop_estimate(human_classified, human_total) + bot_estimates = _pop_estimate(bot_classified, bot_total)Note that this also subtly changes behavior: with the current code, in the "only human" / "only bot" branches the missing side stays
None, but here both are computed independently — same end result, fewer branches.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/snowflake/scripts/sample_and_estimate.py` around lines 434 - 466, The code repeats identical dict construction for human_estimates and bot_estimates across three branches; create a small helper (e.g., build_estimate(population_total, classified_list)) that calls estimate_distribution(classified_list, confidence=args.confidence, seed=args.seed) and returns {"population": population_total, "sample_size": len(classified_list), **distribution}, then replace the duplicated blocks that construct human_estimates and bot_estimates using that helper while preserving current semantics (leave the other side as None when human_classified or bot_classified is absent); reference variables/functions: human_estimates, bot_estimates, human_classified, bot_classified, human_total, bot_total, estimate_distribution, args.confidence, args.seed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 270-274: The current conditional mixes null-check and width-check
causing a null-deref when document.getElementById('{container_id}') returns
null; change the logic to guard the null case first: call
document.getElementById('{container_id}') into el, if (!el) return (or handle
absence), then separately check if (el.clientWidth === 0) and in that branch
call el.setAttribute('data-pending','true') and return; update the code around
the getElementById('{container_id}') / el usage so getElementById, el, and
setAttribute are referenced in those distinct guarded branches.
---
Nitpick comments:
In `@plugins/snowflake/scripts/generate_sankey.py`:
- Around line 44-49: Extract the duplicated helper _get_is_bot into a small
shared module (e.g., plugins/snowflake/scripts/bot_utils.py), move the current
implementation there, and update both generate_sankey.py and
sample_and_estimate.py to import _get_is_bot from that module; ensure the moved
function preserves existing behavior (handling strings like "true"/"1"/"yes" and
non-string truthiness) and run tests to confirm nothing else breaks.
In `@plugins/snowflake/scripts/sample_and_estimate.py`:
- Around line 379-383: The print for human sample uses a conditional expression
that yields an empty string and still calls print, producing a blank line when
human_pop == 0; change this by wrapping the human print in an explicit if check:
replace the conditional-expression print that references h_sampled and human_pop
with an if human_pop > 0: print(...) block so no blank line is emitted (leave
the existing bot print using bot_pop and b_sampled unchanged).
- Around line 434-466: The code repeats identical dict construction for
human_estimates and bot_estimates across three branches; create a small helper
(e.g., build_estimate(population_total, classified_list)) that calls
estimate_distribution(classified_list, confidence=args.confidence,
seed=args.seed) and returns {"population": population_total, "sample_size":
len(classified_list), **distribution}, then replace the duplicated blocks that
construct human_estimates and bot_estimates using that helper while preserving
current semantics (leave the other side as None when human_classified or
bot_classified is absent); reference variables/functions: human_estimates,
bot_estimates, human_classified, bot_classified, human_total, bot_total,
estimate_distribution, args.confidence, args.seed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 14d7deff-c487-4ee4-a8f5-7eb4dd2f12b7
📒 Files selected for processing (3)
plugins/snowflake/commands/activity-type-report.mdplugins/snowflake/scripts/generate_sankey.pyplugins/snowflake/scripts/sample_and_estimate.py
| var el = document.getElementById('{container_id}'); | ||
| if (!el || el.clientWidth === 0) {{ | ||
| el.setAttribute('data-pending', 'true'); | ||
| return; | ||
| }} |
There was a problem hiding this comment.
Null-dereference when the CI chart container is missing.
If document.getElementById('{container_id}') returns null, the !el || el.clientWidth === 0 short-circuits to truthy and then immediately calls el.setAttribute(...), which throws TypeError: Cannot read properties of null. Guard the two cases separately:
🐛 Proposed fix
var el = document.getElementById('{container_id}');
- if (!el || el.clientWidth === 0) {{
- el.setAttribute('data-pending', 'true');
- return;
- }}
+ if (!el) return;
+ if (el.clientWidth === 0) {{
+ el.setAttribute('data-pending', 'true');
+ return;
+ }}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| var el = document.getElementById('{container_id}'); | |
| if (!el || el.clientWidth === 0) {{ | |
| el.setAttribute('data-pending', 'true'); | |
| return; | |
| }} | |
| var el = document.getElementById('{container_id}'); | |
| if (!el) return; | |
| if (el.clientWidth === 0) {{ | |
| el.setAttribute('data-pending', 'true'); | |
| return; | |
| }} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugins/snowflake/scripts/generate_sankey.py` around lines 270 - 274, The
current conditional mixes null-check and width-check causing a null-deref when
document.getElementById('{container_id}') returns null; change the logic to
guard the null case first: call document.getElementById('{container_id}') into
el, if (!el) return (or handle absence), then separately check if
(el.clientWidth === 0) and in that branch call
el.setAttribute('data-pending','true') and return; update the code around the
getElementById('{container_id}') / el usage so getElementById, el, and
setAttribute are referenced in those distinct guarded branches.
|
/retest-required |
|
/test images |
Bot-created Jira issues (e.g., ART image-build-failures) dominate activity type distributions, masking actual human engineering work. This adds bot detection via JIRA_LABEL_RHAI labels and separates the analysis. Changes across 4 files in the snowflake plugin: activity-type-report.md: Added bot_issues CTE to Phase 3 SQL using 13 verified bot labels (auto-created, art:image-build-failure, ai-generated-jira, etc.) identified across 48 HP projects. Updated Phase 6 text summary to show separate human/bot distributions when bots are detected. classify_issues.py: Pass through IS_BOT field from Snowflake data to classified output. Print bot/human split in console summary. sample_and_estimate.py: Stratify samples by (project, is_bot) instead of project alone, ensuring both populations are represented. Compute separate Bayesian estimates for human and bot subsets alongside the overall estimate. generate_sankey.py: Add Human Only/All/Bot Only toggle with three pre-rendered sankeys. Add Source column (Human/Bot) to detail table with dropdown filter. Defer rendering of hidden sankeys via data-pending attribute. When no bot issues exist, the report looks identical to before.
- Fix classify_issues.py to use _get_is_bot() with proper string coercion instead of raw .get(), which would miscount string "false" from Snowflake as truthy - Deduplicate bot_issues CTE in activity-type-report.md to prevent label list drift between the two SQL blocks - Add total == 0 guard in sample_and_estimate.py to prevent ZeroDivisionError on empty input - Add 27 tests for _get_is_bot() and stratified_sample() covering type coercion, missing keys, stratum guarantees, and determinism - Rename misleading 'overall' variable to 'est_section' in generate_summary_stats() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Bot-created Jira issues (e.g., ART image-build-failures) dominate activity type distributions, masking actual human engineering work. This adds bot detection via JIRA_LABEL_RHAI labels and separates the analysis throughout the pipeline.
SQL & skill definition (activity-type-report.md):
Add bot_issues CTE with 13 verified bot labels to Phase 3 SQL Deduplicate CTE in components query to prevent label list drift Update Phase 6 text summary to show separate human/bot distributions Update --sample description to document (project, is_bot) stratification classify_issues.py:
Add _get_is_bot() with proper string coercion (Snowflake returns string "false" which is truthy without coercion) Pass through is_bot field using _get_is_bot() instead of raw .get() Print bot/human split in console output
sample_and_estimate.py:
Stratify samples by (project, is_bot) instead of project alone Compute separate Bayesian estimates for human and bot subsets Add total==0 guard to prevent ZeroDivisionError on empty input generate_sankey.py:
Add Human Only / All / Bot Only toggle with three pre-rendered sankeys Add Source column (Human/Bot) to detail table with dropdown filter Generate per-view CI charts for sampling mode
Fix CI chart rendering: move D3 script loads before CI chart scripts Defer rendering of hidden sankeys/CI charts via data-pending attribute Rename misleading 'overall' variable to 'est_section' in generate_summary_stats() test_sample_and_estimate.py:
Add 27 tests covering _get_is_bot() type coercion, missing keys, and stratified_sample() stratum guarantees, proportionality, and determinism
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Documentation
Tests