-
Notifications
You must be signed in to change notification settings - Fork 37
Rebackfill the poll stats counter and engagement counter after dedupl… #1319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…icating the poll stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a new Django migration to deduplicate PollStats entries by questions and then re-backfill the PollStatsCounter and PollEngagementDailyCount tables with the deduplicated data. This migration is a follow-up to migration 0032, which initially backfilled these counter tables.
Key Changes:
- Introduces a deduplication function that identifies FlowResults with multiple associated PollQuestions and consolidates their PollStats
- Re-backfills PollStatsCounter and PollEngagementDailyCount entries after deduplication, using a separate cache key for resumability
- Deletes and re-creates counters for stopped polls to ensure data consistency after deduplication
Comments suppressed due to low confidence (1)
ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py:41
- This assignment to 'stats' is unnecessary as it is redefined before this value is used.
stats = (
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py
Outdated
Show resolved
Hide resolved
ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py
Outdated
Show resolved
Hide resolved
ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py
Outdated
Show resolved
Hide resolved
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1319 +/- ##
=======================================
Coverage 83.20% 83.20%
=======================================
Files 49 49
Lines 6102 6102
=======================================
Hits 5077 5077
Misses 1025 1025 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py
Outdated
Show resolved
Hide resolved
ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py
Outdated
Show resolved
Hide resolved
ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| PollStats = apps.get_model("stats", "PollStats") | ||
|
|
||
| GenderSegment = apps.get_model("stats", "GenderSegment") |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The backfill_poll_stats_counters function lacks documentation explaining its purpose, parameters, and the complex backfilling logic. Given the function's complexity (handling segments, locations, scopes, caching), a comprehensive docstring would significantly improve maintainability.
| PollStats = apps.get_model("stats", "PollStats") | |
| GenderSegment = apps.get_model("stats", "GenderSegment") | |
| """ | |
| Backfills PollStatsCounter and PollEngagementDailyCount data for existing PollStats entries. | |
| This migration function iterates over all PollStats records and reconstructs the associated | |
| counter and engagement daily count data, ensuring that statistics are correctly populated | |
| for all relevant segments (gender, age, scheme), locations (state, district, ward), and | |
| scopes (overall, segmented, location-specific). | |
| The function: | |
| - Retrieves all necessary models using the provided `apps` registry. | |
| - Iterates through PollStats records, handling each according to its segmentation and location. | |
| - Handles deduplication and aggregation logic for segments and locations. | |
| - Utilizes caching to optimize repeated lookups and avoid redundant database queries. | |
| - Ensures that counters and engagement counts are created or updated as needed. | |
| Args: | |
| apps: The Django app registry for migrations, used to get historical models. | |
| schema_editor: The database schema editor (unused, but required by Django migration API). | |
| This function is intended to be run as part of a Django data migration and should not be called directly. | |
| """ | |
| PollStats = apps.get_model("stats", "PollStats") |
| count=stat.count, | ||
| ) | ||
| engagement_counter_kwargs = dict() | ||
| if stat.date is not None and stat.date >= (timezone.now() - timedelta(days=400)): |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The magic number 400 for the days threshold is hardcoded without explanation. This threshold determines which poll stats get engagement counters created. Consider adding a comment explaining why 400 days was chosen, or defining it as a named constant (e.g., ENGAGEMENT_TRACKING_DAYS = 400) to improve code clarity.
| poll_engagement_daily_count_ids_count = len(poll_engagement_daily_count_ids) | ||
|
|
||
| for batch in chunk_list(poll_engagement_daily_count_ids, 1000): | ||
| PollEngagementDailyCount.objects.filter(pk__in=list(batch)).delete() |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as with poll_stats_counters_ids: the batch iterator is consumed when converting to list on line 88, but chunk_list returns iterators that can only be consumed once. The batch should be converted to a list before the loop, not inside the filter() call.
The previous migration (0032) uses batch directly in the filter without wrapping in list(), which is the correct approach.
| PollEngagementDailyCount.objects.filter(pk__in=list(batch)).delete() | |
| PollEngagementDailyCount.objects.filter(pk__in=batch).delete() |
| pass | ||
|
|
||
|
|
||
| def dedupe_poll_stats_by_questions(apps, schema_editor): # pragma: no cover |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The function name dedupe_poll_stats_by_questions is misleading. The function doesn't deduplicate by questions - it removes duplicate PollStats entries for a flow result and sets the remaining entry's question to None. A more accurate name would be remove_duplicate_poll_stats_for_flow_results or normalize_poll_stats_questions.
| pass | ||
|
|
||
|
|
||
| def dedupe_poll_stats_by_questions(apps, schema_editor): # pragma: no cover |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The dedupe_poll_stats_by_questions function lacks documentation explaining its purpose and the deduplication logic. Given the complexity of the operation (removing duplicate PollStats entries and nullifying question references), a docstring explaining why this is necessary and what the expected outcome is would improve maintainability.
| def dedupe_poll_stats_by_questions(apps, schema_editor): # pragma: no cover | |
| def dedupe_poll_stats_by_questions(apps, schema_editor): # pragma: no cover | |
| """ | |
| Deduplicate PollStats entries for FlowResults with multiple associated PollQuestions. | |
| For each FlowResult that is linked to two or more PollQuestions, this function: | |
| 1. Deletes all PollStats entries for that FlowResult except the one associated with the first PollQuestion. | |
| 2. Sets the question field to None for the remaining PollStats entry. | |
| This is necessary to ensure that each FlowResult has at most one PollStats entry, | |
| and that the PollStats entry is not ambiguously linked to a specific PollQuestion. | |
| The expected outcome is that there are no duplicate PollStats per FlowResult, | |
| and all remaining PollStats have question=None. | |
| """ |
| for batch in chunk_list(poll_stats_counters_ids, 1000): | ||
| PollStatsCounter.objects.filter(pk__in=list(batch)).delete() |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chunk_list function returns an iterator of iterators, not lists. On line 80, calling list(batch) only converts the outer iterator to a list, but each item in that list is still an iterator that can only be consumed once. This is problematic because the iterator is consumed when creating the list on line 112 (batch_ids = list(batch)), leaving nothing for subsequent iterations.
The same issue exists on line 88. The batch should be converted to a list before being used in the delete query filter.
This differs from the previous migration (0032) where batch is used directly without converting to a list twice, which works correctly.
|
|
||
| Poll = apps.get_model("polls", "Poll") | ||
| PollQuestion = apps.get_model("polls", "PollQuestion") | ||
| Boundary = apps.get_model("locations", "Boundary") |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The magic numbers STATE_LEVEL = 1, DISTRICT_LEVEL = 2, and WARD_LEVEL = 3 are defined directly in the migration function. If these constants are defined elsewhere in the codebase (e.g., in a model or constants file), they should be referenced from there to maintain consistency. If the source model or constants change, this migration could have incorrect hardcoded values.
Consider documenting why these values are hardcoded here if they cannot be imported from the model definition.
| Boundary = apps.get_model("locations", "Boundary") | |
| Boundary = apps.get_model("locations", "Boundary") | |
| # These values are hardcoded here because Django migrations should not import from application code. | |
| # If the values of STATE_LEVEL, DISTRICT_LEVEL, or WARD_LEVEL change in the source models/constants, | |
| # this migration may need to be updated accordingly. |
…icating the poll stats