-
Notifications
You must be signed in to change notification settings - Fork 2k
fix: run cohortpeople count query without joins #39925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
return 0 | ||
|
||
# Then count cohortpeople (in persons DB, no cross-DB join) | ||
count = CohortPeople.objects.filter(cohort_id=cohort_id).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Missing team_id
filter in count query causes incorrect counts in multi-team projects.
When recalculate_cohortpeople
runs for a project with multiple teams (environments), it calls this function for each team. The same cohort_id
can have CohortPeople entries with different team_id
values. Without filtering by team_id
, all teams get the same total count instead of team-specific counts.
The ClickHouse table has team_id
but the Django model doesn't expose it. Use raw SQL to filter properly:
count = CohortPeople.objects.filter(cohort_id=cohort_id).count() | |
# Then count cohortpeople filtering by team_id (in persons DB, no cross-DB join) | |
result = sync_execute( | |
f"SELECT count() FROM {PERSON_STATIC_COHORT_TABLE} WHERE cohort_id = %(cohort_id)s AND team_id = %(team_id)s", | |
{"cohort_id": cohort_id, "team_id": team_id} | |
) | |
count = result[0][0] if result else 0 |
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/models/cohort/util.py
Line: 428:428
Comment:
**logic:** Missing `team_id` filter in count query causes incorrect counts in multi-team projects.
When `recalculate_cohortpeople` runs for a project with multiple teams (environments), it calls this function for each team. The same `cohort_id` can have CohortPeople entries with different `team_id` values. Without filtering by `team_id`, all teams get the same total count instead of team-specific counts.
The ClickHouse table has `team_id` but the Django model doesn't expose it. Use raw SQL to filter properly:
```suggestion
# Then count cohortpeople filtering by team_id (in persons DB, no cross-DB join)
result = sync_execute(
f"SELECT count() FROM {PERSON_STATIC_COHORT_TABLE} WHERE cohort_id = %(cohort_id)s AND team_id = %(team_id)s",
{"cohort_id": cohort_id, "team_id": team_id}
)
count = result[0][0] if result else 0
```
How can I resolve this? If you propose a fix, please make it concise.
def test_static_cohort_size_validates_team(self): | ||
from posthog.models.cohort.util import get_static_cohort_size | ||
|
||
# Create another team in the same organization (different project) | ||
team2 = Team.objects.create(organization=self.organization) | ||
|
||
# Create people in both teams | ||
Person.objects.create(team=self.team, distinct_ids=["person1_team1"]) | ||
Person.objects.create(team=self.team, distinct_ids=["person2_team1"]) | ||
Person.objects.create(team=team2, distinct_ids=["person1_team2"]) | ||
|
||
# Create a static cohort in team1 | ||
cohort = Cohort.objects.create(team=self.team, is_static=True, name="test cohort") | ||
cohort.insert_users_by_list(["person1_team1", "person2_team1"]) | ||
|
||
# Count should work for the correct team | ||
count_correct = get_static_cohort_size(cohort_id=cohort.pk, team_id=self.team.pk) | ||
assert count_correct == 2 | ||
|
||
# Count should be 0 for a different team (validates team ownership) | ||
count_wrong_team = get_static_cohort_size(cohort_id=cohort.pk, team_id=team2.pk) | ||
assert count_wrong_team == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Test doesn't cover the multi-team-in-same-project scenario where the bug occurs.
The current test validates team isolation across different organizations, but doesn't test the case where multiple teams in the same project share a cohort. In recalculate_cohortpeople
, all teams in a project get CohortPeople entries for the same cohort_id
but with different team_id
values.
Add a test case like:
# Create two teams in the same project
team2 = Team.objects.create(organization=self.organization, project_id=self.team.project_id)
# Add people to the cohort from both teams
cohort.insert_users_by_list(["person1_team1"], team_id=self.team.pk)
cohort.insert_users_by_list(["person1_team2"], team_id=team2.pk)
# Each team should see only their count
assert get_static_cohort_size(cohort_id=cohort.pk, team_id=self.team.pk) == 1
assert get_static_cohort_size(cohort_id=cohort.pk, team_id=team2.pk) == 1
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/test/test_cohort_model.py
Line: 342:363
Comment:
**logic:** Test doesn't cover the multi-team-in-same-project scenario where the bug occurs.
The current test validates team isolation across different organizations, but doesn't test the case where multiple teams in the same project share a cohort. In `recalculate_cohortpeople`, all teams in a project get CohortPeople entries for the same `cohort_id` but with different `team_id` values.
Add a test case like:
```python
# Create two teams in the same project
team2 = Team.objects.create(organization=self.organization, project_id=self.team.project_id)
# Add people to the cohort from both teams
cohort.insert_users_by_list(["person1_team1"], team_id=self.team.pk)
cohort.insert_users_by_list(["person1_team2"], team_id=team2.pk)
# Each team should see only their count
assert get_static_cohort_size(cohort_id=cohort.pk, team_id=self.team.pk) == 1
assert get_static_cohort_size(cohort_id=cohort.pk, team_id=team2.pk) == 1
```
How can I resolve this? If you propose a fix, please make it concise.
Problem
We join cohortpeople count queries with persons table to validate the team ID, which is super slow for large cohorts.
Changes
Changes the count query to run a separate query to check the team id.
How did you test this code?
Added a test to verify team isolation.