Skip to content

Conversation

pl
Copy link
Contributor

@pl pl commented Oct 19, 2025

Problem

We join cohortpeople count queries with persons table to validate the team ID, which is super slow for large cohorts.

Changes

Changes the count query to run a separate query to check the team id.

How did you test this code?

Added a test to verify team isolation.

@pl pl requested review from a team and jose-sequeira October 19, 2025 00:33
@posthog-project-board-bot posthog-project-board-bot bot moved this to In Review in Feature Flags Oct 19, 2025
@github-project-automation github-project-automation bot moved this from In Review to Approved in Feature Flags Oct 19, 2025
Base automatically changed from pl/revert_cohort_count to master October 19, 2025 00:41
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

return 0

# Then count cohortpeople (in persons DB, no cross-DB join)
count = CohortPeople.objects.filter(cohort_id=cohort_id).count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Missing team_id filter in count query causes incorrect counts in multi-team projects.

When recalculate_cohortpeople runs for a project with multiple teams (environments), it calls this function for each team. The same cohort_id can have CohortPeople entries with different team_id values. Without filtering by team_id, all teams get the same total count instead of team-specific counts.

The ClickHouse table has team_id but the Django model doesn't expose it. Use raw SQL to filter properly:

Suggested change
count = CohortPeople.objects.filter(cohort_id=cohort_id).count()
# Then count cohortpeople filtering by team_id (in persons DB, no cross-DB join)
result = sync_execute(
f"SELECT count() FROM {PERSON_STATIC_COHORT_TABLE} WHERE cohort_id = %(cohort_id)s AND team_id = %(team_id)s",
{"cohort_id": cohort_id, "team_id": team_id}
)
count = result[0][0] if result else 0
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/models/cohort/util.py
Line: 428:428

Comment:
**logic:** Missing `team_id` filter in count query causes incorrect counts in multi-team projects.

When `recalculate_cohortpeople` runs for a project with multiple teams (environments), it calls this function for each team. The same `cohort_id` can have CohortPeople entries with different `team_id` values. Without filtering by `team_id`, all teams get the same total count instead of team-specific counts.

The ClickHouse table has `team_id` but the Django model doesn't expose it. Use raw SQL to filter properly:

```suggestion
    # Then count cohortpeople filtering by team_id (in persons DB, no cross-DB join)
    result = sync_execute(
        f"SELECT count() FROM {PERSON_STATIC_COHORT_TABLE} WHERE cohort_id = %(cohort_id)s AND team_id = %(team_id)s",
        {"cohort_id": cohort_id, "team_id": team_id}
    )
    count = result[0][0] if result else 0
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +342 to +363
def test_static_cohort_size_validates_team(self):
from posthog.models.cohort.util import get_static_cohort_size

# Create another team in the same organization (different project)
team2 = Team.objects.create(organization=self.organization)

# Create people in both teams
Person.objects.create(team=self.team, distinct_ids=["person1_team1"])
Person.objects.create(team=self.team, distinct_ids=["person2_team1"])
Person.objects.create(team=team2, distinct_ids=["person1_team2"])

# Create a static cohort in team1
cohort = Cohort.objects.create(team=self.team, is_static=True, name="test cohort")
cohort.insert_users_by_list(["person1_team1", "person2_team1"])

# Count should work for the correct team
count_correct = get_static_cohort_size(cohort_id=cohort.pk, team_id=self.team.pk)
assert count_correct == 2

# Count should be 0 for a different team (validates team ownership)
count_wrong_team = get_static_cohort_size(cohort_id=cohort.pk, team_id=team2.pk)
assert count_wrong_team == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Test doesn't cover the multi-team-in-same-project scenario where the bug occurs.

The current test validates team isolation across different organizations, but doesn't test the case where multiple teams in the same project share a cohort. In recalculate_cohortpeople, all teams in a project get CohortPeople entries for the same cohort_id but with different team_id values.

Add a test case like:

# Create two teams in the same project
team2 = Team.objects.create(organization=self.organization, project_id=self.team.project_id)

# Add people to the cohort from both teams
cohort.insert_users_by_list(["person1_team1"], team_id=self.team.pk)
cohort.insert_users_by_list(["person1_team2"], team_id=team2.pk)

# Each team should see only their count
assert get_static_cohort_size(cohort_id=cohort.pk, team_id=self.team.pk) == 1
assert get_static_cohort_size(cohort_id=cohort.pk, team_id=team2.pk) == 1
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/test/test_cohort_model.py
Line: 342:363

Comment:
**logic:** Test doesn't cover the multi-team-in-same-project scenario where the bug occurs.

The current test validates team isolation across different organizations, but doesn't test the case where multiple teams in the same project share a cohort. In `recalculate_cohortpeople`, all teams in a project get CohortPeople entries for the same `cohort_id` but with different `team_id` values.

Add a test case like:
```python
# Create two teams in the same project
team2 = Team.objects.create(organization=self.organization, project_id=self.team.project_id)

# Add people to the cohort from both teams
cohort.insert_users_by_list(["person1_team1"], team_id=self.team.pk)
cohort.insert_users_by_list(["person1_team2"], team_id=team2.pk)

# Each team should see only their count
assert get_static_cohort_size(cohort_id=cohort.pk, team_id=self.team.pk) == 1
assert get_static_cohort_size(cohort_id=cohort.pk, team_id=team2.pk) == 1
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Approved

Development

Successfully merging this pull request may close these issues.

2 participants