Skip to content

perf(RHINENG-25330): rewrite get_host_counts_batch to avoid full hosts table scan#3847

Open
rodrigonull wants to merge 1 commit intomasterfrom
RHINENG-25330
Open

perf(RHINENG-25330): rewrite get_host_counts_batch to avoid full hosts table scan#3847
rodrigonull wants to merge 1 commit intomasterfrom
RHINENG-25330

Conversation

@rodrigonull
Copy link
Copy Markdown
Member

@rodrigonull rodrigonull commented Mar 27, 2026

Overview

This PR is being created to address RHINENG-25330.

Replace the single expensive JOIN query with a "total minus excluded" strategy: count all members from hosts_groups (index-only scan, no join) then subtract the small number of culled/NULL deletion_timestamp hosts. For large orgs, this eliminates the sequential scan on the wide hosts table.

PR Checklist

  • Keep PR title short, ideally under 72 characters
  • Descriptive comments provided in complex code blocks
  • Include raw query examples in the PR description, if adding/modifying SQL query
  • Tests: validate optimal/expected output
  • Tests: validate exceptions and failure scenarios
  • Tests: edge cases
  • Recovers or fails gracefully during potential resource outages (e.g. DB, Kafka)
  • Uses type hinting, if convenient
  • Documentation, if this PR changes the way other services interact with host inventory
  • Links to related PRs

Secure Coding Practices Documentation Reference

You can find documentation on this checklist here.

Secure Coding Checklist

  • Input Validation
  • Output Encoding
  • Authentication and Password Management
  • Session Management
  • Access Control
  • Cryptographic Practices
  • Error Handling and Logging
  • Data Protection
  • Communication Security
  • System Configuration
  • Database Security
  • File Management
  • Memory Management
  • General Coding Practices

Summary by Sourcery

Optimize batched host count retrieval for groups by replacing a join-based query with a two-phase "total minus excluded" counting strategy that avoids full scans of the hosts table for large organizations.

Enhancements:

  • Rework get_host_counts_batch to use separate total and excluded member counts, minimizing joins to the wide hosts table and improving performance on large groups.
  • Update get_host_counts_batch documentation to describe the new "total minus culled" counting approach.

Summary by Sourcery

Optimize batched host count retrieval for groups by replacing a single join-based query with a two-phase counting strategy that avoids full scans of the hosts table for large organizations.

Enhancements:

  • Rework get_host_counts_batch to compute host counts using separate total and excluded member queries, minimizing joins to the wide hosts table and improving performance on large groups.
  • Update get_host_counts_batch documentation to describe the new "total minus culled" counting approach.

@github-actions
Copy link
Copy Markdown
Contributor

SC Environment Impact Assessment

Overall Impact:NONE

No SC Environment-specific impacts detected in this PR.

What was checked

This PR was automatically scanned for:

  • Database migrations
  • ClowdApp configuration changes
  • Kessel integration changes
  • AWS service integrations (S3, RDS, ElastiCache)
  • Kafka topic changes
  • Secrets management changes
  • External dependencies

@rodrigonull rodrigonull marked this pull request as ready for review March 27, 2026 13:21
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new "total minus excluded" logic partially reimplements the non-culled filter; consider reusing find_non_culled_hosts or extracting a shared helper so the business rules for culled/non-culled hosts cannot drift over time.
  • When computing count_map[gid_str] = total_map.get(gid_str, 0) - excluded_map.get(gid_str, 0), consider clamping at zero (e.g. max(..., 0)) or asserting non-negativity to guard against any edge cases where the two queries become inconsistent.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new "total minus excluded" logic partially reimplements the non-culled filter; consider reusing `find_non_culled_hosts` or extracting a shared helper so the business rules for culled/non-culled hosts cannot drift over time.
- When computing `count_map[gid_str] = total_map.get(gid_str, 0) - excluded_map.get(gid_str, 0)`, consider clamping at zero (e.g. `max(..., 0)`) or asserting non-negativity to guard against any edge cases where the two queries become inconsistent.

## Individual Comments

### Comment 1
<location path="lib/host_repository.py" line_range="427" />
<code_context>
-        db.session.query(HostGroupAssoc.group_id, func.count(Host.id).label("host_count"))
-        .join(Host)
+    # Count ALL members per group from hosts_groups only (index-only scan, no join).
+    total_query = (
+        db.session.query(HostGroupAssoc.group_id, func.count().label("total"))
         .filter(HostGroupAssoc.org_id == org_id, HostGroupAssoc.group_id.in_(group_ids))
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the map construction and extracting the staleness/exclusion and two-query strategy into helpers so the main function clearly expresses high-level intent.

You can keep the optimization but local complexity can be reduced and staleness semantics encapsulated a bit better.

### 1. Inline map-merge can be simplified

You don’t need the prefilled `count_map` + loop; you can compute it in one expression and avoid mutation:

```python
total_map = {str(gid): total for gid, total in total_query.all()}
excluded_map = {str(gid): excluded for gid, excluded in excluded_query.all()}

count_map = {
    str(gid): total_map.get(str(gid), 0) - excluded_map.get(str(gid), 0)
    for gid in group_ids
}
return count_map
```

This keeps the “total minus excluded” behavior but makes it easier to see the core logic.

### 2. Encapsulate the staleness / exclusion filter

The custom staleness logic is currently inlined and partially duplicates existing abstractions. You can wrap it in a small helper so that the query reads as “exclude culled hosts” instead of embedding the details:

```python
def _excluded_hosts_filter() -> ColumnElement[bool]:
    staleness_filter = HostStalenessStatesDbFilters()
    return or_(
        staleness_filter.culled(),
        Host.deletion_timestamp.is_(None),
    )
```

Then in `get_host_counts_batch`:

```python
excluded_query = (
    db.session.query(
        HostGroupAssoc.group_id,
        func.count().label("excluded"),
    )
    .join(
        Host,
        and_(
            HostGroupAssoc.host_id == Host.id,
            HostGroupAssoc.org_id == Host.org_id,
        ),
    )
    .filter(
        HostGroupAssoc.org_id == org_id,
        HostGroupAssoc.group_id.in_(group_ids),
        _excluded_hosts_filter(),
    )
    .group_by(HostGroupAssoc.group_id)
)
```

That keeps the optimization (only culled/“excluded” hosts hit the join) but moves the staleness semantics into one place so it’s easier to keep consistent with `find_non_culled_hosts` and to adjust later if the definition changes.

### 3. Optional: encapsulate the two-query strategy in a helper

If you want to further separate domain intent from the query strategy, you can push the “total minus excluded” details into a private helper. `get_host_counts_batch` then just expresses high-level intent:

```python
def _get_non_culled_host_counts_for_groups(
    org_id: str, group_ids: list[str]
) -> dict[str, int]:
    if not group_ids:
        return {}

    total_query = (
        db.session.query(
            HostGroupAssoc.group_id,
            func.count().label("total"),
        )
        .filter(
            HostGroupAssoc.org_id == org_id,
            HostGroupAssoc.group_id.in_(group_ids),
        )
        .group_by(HostGroupAssoc.group_id)
    )
    total_map = {str(gid): total for gid, total in total_query.all()}

    excluded_query = (
        db.session.query(
            HostGroupAssoc.group_id,
            func.count().label("excluded"),
        )
        .join(
            Host,
            and_(
                HostGroupAssoc.host_id == Host.id,
                HostGroupAssoc.org_id == Host.org_id,
            ),
        )
        .filter(
            HostGroupAssoc.org_id == org_id,
            HostGroupAssoc.group_id.in_(group_ids),
            _excluded_hosts_filter(),
        )
        .group_by(HostGroupAssoc.group_id)
    )
    excluded_map = {str(gid): excluded for gid, excluded in excluded_query.all()}

    return {
        str(gid): total_map.get(str(gid), 0) - excluded_map.get(str(gid), 0)
        for gid in group_ids
    }


def get_host_counts_batch(org_id: str, group_ids: list[str]) -> dict[str, int]:
    return _get_non_culled_host_counts_for_groups(org_id, group_ids)
```

Functionality and performance characteristics stay the same, but `get_host_counts_batch` is again “about counts per group,” and the performance/SQL strategy is isolated and easier to reason about or test.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@rodrigonull rodrigonull force-pushed the RHINENG-25330 branch 3 times, most recently from ce3a97a to 94236c7 Compare March 31, 2026 12:12
…s table scan

Replace the single expensive JOIN query with a "total minus excluded"
strategy: count all members from hosts_groups (index-only scan, no
join) then subtract the small number of culled/NULL deletion_timestamp
hosts. For large orgs with 500K+ hosts in a group, this eliminates
the sequential scan on the wide hosts table.
Comment on lines +272 to +282
def _excluded_hosts_filter():
"""Filter matching hosts that should be excluded from active counts.

Matches culled hosts (deletion_timestamp <= now) and hosts with NULL
deletion_timestamp. This is the inverse of the non-culled filter used
by find_non_culled_hosts, kept in sync via HostStalenessStatesDbFilters.
"""
staleness_filter = HostStalenessStatesDbFilters()
return or_(staleness_filter.culled(), Host.deletion_timestamp.is_(None))


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is now overlap with find_non_culled_hosts. You could change that function to just

return query.filter(not_(_excluded_hosts_filter()))

Comment on lines +445 to +457
# Count EXCLUDED members: culled (deletion_timestamp <= now) OR NULL deletion_timestamp.
# This query joins the hosts table but matches very few rows (typically <1%).
excluded_query = (
db.session.query(HostGroupAssoc.group_id, func.count().label("excluded"))
.join(Host, and_(HostGroupAssoc.host_id == Host.id, HostGroupAssoc.org_id == Host.org_id))
.filter(
HostGroupAssoc.org_id == org_id,
HostGroupAssoc.group_id.in_(group_ids),
_excluded_hosts_filter(),
)
.group_by(HostGroupAssoc.group_id)
)
excluded_map = {str(gid): excluded for gid, excluded in excluded_query.all()}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new version of this function now does 2 queries instead of 1, and I'd expect this second query to do similar .join work to the original. The implementation's a bit more complicated, but it could be worth it if the performance gain is large enough. Have you compared the old and new queries using EXPLAIN ANALYZE or something like that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants