Rebackfill the poll stats counter and engagement counter after dedupl… #1319

norkans7 · 2025-12-01T14:51:58Z

…icating the poll stats

Copilot

Pull request overview

This PR adds a new Django migration to deduplicate PollStats entries by questions and then re-backfill the PollStatsCounter and PollEngagementDailyCount tables with the deduplicated data. This migration is a follow-up to migration 0032, which initially backfilled these counter tables.

Key Changes:

Introduces a deduplication function that identifies FlowResults with multiple associated PollQuestions and consolidates their PollStats
Re-backfills PollStatsCounter and PollEngagementDailyCount entries after deduplication, using a separate cache key for resumability
Deletes and re-creates counters for stopped polls to ensure data consistency after deduplication

Comments suppressed due to low confidence (1)

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py:41

This assignment to 'stats' is unnecessary as it is redefined before this value is used.

        stats = (

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

codecov-commenter · 2025-12-01T14:58:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.20%. Comparing base (483da2b) to head (8665fea).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1319   +/-   ##
=======================================
  Coverage   83.20%   83.20%           
=======================================
  Files          49       49           
  Lines        6102     6102           
=======================================
  Hits         5077     5077           
  Misses       1025     1025

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-02T08:33:03Z

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

+    PollStats = apps.get_model("stats", "PollStats")
+
+    GenderSegment = apps.get_model("stats", "GenderSegment")


[nitpick] The backfill_poll_stats_counters function lacks documentation explaining its purpose, parameters, and the complex backfilling logic. Given the function's complexity (handling segments, locations, scopes, caching), a comprehensive docstring would significantly improve maintainability.

Suggested change

PollStats = apps.get_model("stats", "PollStats")

GenderSegment = apps.get_model("stats", "GenderSegment")

"""

Backfills PollStatsCounter and PollEngagementDailyCount data for existing PollStats entries.

This migration function iterates over all PollStats records and reconstructs the associated

counter and engagement daily count data, ensuring that statistics are correctly populated

for all relevant segments (gender, age, scheme), locations (state, district, ward), and

scopes (overall, segmented, location-specific).

The function:

- Retrieves all necessary models using the provided `apps` registry.

- Iterates through PollStats records, handling each according to its segmentation and location.

- Handles deduplication and aggregation logic for segments and locations.

- Utilizes caching to optimize repeated lookups and avoid redundant database queries.

- Ensures that counters and engagement counts are created or updated as needed.

Args:

apps: The Django app registry for migrations, used to get historical models.

schema_editor: The database schema editor (unused, but required by Django migration API).

This function is intended to be run as part of a Django data migration and should not be called directly.

"""

PollStats = apps.get_model("stats", "PollStats")

Copilot · 2025-12-02T08:33:03Z

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

+                count=stat.count,
+            )
+            engagement_counter_kwargs = dict()
+            if stat.date is not None and stat.date >= (timezone.now() - timedelta(days=400)):


[nitpick] The magic number 400 for the days threshold is hardcoded without explanation. This threshold determines which poll stats get engagement counters created. Consider adding a comment explaining why 400 days was chosen, or defining it as a named constant (e.g., ENGAGEMENT_TRACKING_DAYS = 400) to improve code clarity.

Copilot · 2025-12-02T08:33:04Z

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

+        poll_engagement_daily_count_ids_count = len(poll_engagement_daily_count_ids)
+
+        for batch in chunk_list(poll_engagement_daily_count_ids, 1000):
+            PollEngagementDailyCount.objects.filter(pk__in=list(batch)).delete()


Same issue as with poll_stats_counters_ids: the batch iterator is consumed when converting to list on line 88, but chunk_list returns iterators that can only be consumed once. The batch should be converted to a list before the loop, not inside the filter() call.

The previous migration (0032) uses batch directly in the filter without wrapping in list(), which is the correct approach.

Suggested change

PollEngagementDailyCount.objects.filter(pk__in=list(batch)).delete()

PollEngagementDailyCount.objects.filter(pk__in=batch).delete()

Copilot · 2025-12-02T08:33:04Z

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

+    pass
+
+
+def dedupe_poll_stats_by_questions(apps, schema_editor):  # pragma: no cover


[nitpick] The function name dedupe_poll_stats_by_questions is misleading. The function doesn't deduplicate by questions - it removes duplicate PollStats entries for a flow result and sets the remaining entry's question to None. A more accurate name would be remove_duplicate_poll_stats_for_flow_results or normalize_poll_stats_questions.

Copilot · 2025-12-02T08:33:04Z

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

+    pass
+
+
+def dedupe_poll_stats_by_questions(apps, schema_editor):  # pragma: no cover


[nitpick] The dedupe_poll_stats_by_questions function lacks documentation explaining its purpose and the deduplication logic. Given the complexity of the operation (removing duplicate PollStats entries and nullifying question references), a docstring explaining why this is necessary and what the expected outcome is would improve maintainability.

Suggested change

def dedupe_poll_stats_by_questions(apps, schema_editor): # pragma: no cover

def dedupe_poll_stats_by_questions(apps, schema_editor): # pragma: no cover

"""

Deduplicate PollStats entries for FlowResults with multiple associated PollQuestions.

For each FlowResult that is linked to two or more PollQuestions, this function:

1. Deletes all PollStats entries for that FlowResult except the one associated with the first PollQuestion.

2. Sets the question field to None for the remaining PollStats entry.

This is necessary to ensure that each FlowResult has at most one PollStats entry,

and that the PollStats entry is not ambiguously linked to a specific PollQuestion.

The expected outcome is that there are no duplicate PollStats per FlowResult,

and all remaining PollStats have question=None.

"""

Copilot · 2025-12-02T08:33:05Z

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

+        for batch in chunk_list(poll_stats_counters_ids, 1000):
+            PollStatsCounter.objects.filter(pk__in=list(batch)).delete()


The chunk_list function returns an iterator of iterators, not lists. On line 80, calling list(batch) only converts the outer iterator to a list, but each item in that list is still an iterator that can only be consumed once. This is problematic because the iterator is consumed when creating the list on line 112 (batch_ids = list(batch)), leaving nothing for subsequent iterations.

The same issue exists on line 88. The batch should be converted to a list before being used in the delete query filter.

This differs from the previous migration (0032) where batch is used directly without converting to a list twice, which works correctly.

Copilot · 2025-12-02T08:33:05Z

ureport/stats/migrations/0033_backfill_poll_stats_counters_dedupes.py

+
+    Poll = apps.get_model("polls", "Poll")
+    PollQuestion = apps.get_model("polls", "PollQuestion")
+    Boundary = apps.get_model("locations", "Boundary")


[nitpick] The magic numbers STATE_LEVEL = 1, DISTRICT_LEVEL = 2, and WARD_LEVEL = 3 are defined directly in the migration function. If these constants are defined elsewhere in the codebase (e.g., in a model or constants file), they should be referenced from there to maintain consistency. If the source model or constants change, this migration could have incorrect hardcoded values.

Consider documenting why these values are hardcoded here if they cannot be imported from the model definition.

Suggested change

Boundary = apps.get_model("locations", "Boundary")

Boundary = apps.get_model("locations", "Boundary")

# These values are hardcoded here because Django migrations should not import from application code.

# If the values of STATE_LEVEL, DISTRICT_LEVEL, or WARD_LEVEL change in the source models/constants,

# this migration may need to be updated accordingly.

Rebackfill the poll stats counter and engagement counter after dedupl…

aaba5f0

…icating the poll stats

Copilot AI review requested due to automatic review settings December 1, 2025 14:51

Copilot started reviewing on behalf of norkans7 December 1, 2025 14:52 View session

Copilot finished reviewing on behalf of norkans7 December 1, 2025 14:55

Copilot AI reviewed Dec 1, 2025

View reviewed changes

norkans7 and others added 2 commits December 1, 2025 17:04

Apply suggestions from code review

bd23aef

Co-authored-by: Copilot <[email protected]>

Remove unnecessary loop on questions

dea62c7

norkans7 requested a review from Copilot December 1, 2025 15:29

Copilot started reviewing on behalf of norkans7 December 1, 2025 15:29 View session

Copilot finished reviewing on behalf of norkans7 December 1, 2025 15:30

Copilot AI reviewed Dec 1, 2025

View reviewed changes

norkans7 and others added 2 commits December 1, 2025 18:02

Apply suggestions from code review

7603a34

Co-authored-by: Copilot <[email protected]>

Apply suggestions from code review

3597d4d

Co-authored-by: Copilot <[email protected]>

norkans7 requested a review from Copilot December 2, 2025 08:28

Copilot started reviewing on behalf of norkans7 December 2, 2025 08:29 View session

Copilot finished reviewing on behalf of norkans7 December 2, 2025 08:31

Copilot AI reviewed Dec 2, 2025

View reviewed changes

Apply copilot code review suggestions

8665fea

norkans7 merged commit 8a24dd7 into main Dec 2, 2025
2 checks passed

		PollStats = apps.get_model("stats", "PollStats")

		GenderSegment = apps.get_model("stats", "GenderSegment")

-    PollStats = apps.get_model("stats", "PollStats")
-    GenderSegment = apps.get_model("stats", "GenderSegment")
+    """
+    Backfills PollStatsCounter and PollEngagementDailyCount data for existing PollStats entries.
+    This migration function iterates over all PollStats records and reconstructs the associated
+    counter and engagement daily count data, ensuring that statistics are correctly populated
+    for all relevant segments (gender, age, scheme), locations (state, district, ward), and
+    scopes (overall, segmented, location-specific).
+    The function:
+        - Retrieves all necessary models using the provided `apps` registry.
+        - Iterates through PollStats records, handling each according to its segmentation and location.
+        - Handles deduplication and aggregation logic for segments and locations.
+        - Utilizes caching to optimize repeated lookups and avoid redundant database queries.
+        - Ensures that counters and engagement counts are created or updated as needed.
+    Args:
+        apps: The Django app registry for migrations, used to get historical models.
+        schema_editor: The database schema editor (unused, but required by Django migration API).
+    This function is intended to be run as part of a Django data migration and should not be called directly.
+    """
+    PollStats = apps.get_model("stats", "PollStats")

	PollEngagementDailyCount.objects.filter(pk__in=list(batch)).delete()
	PollEngagementDailyCount.objects.filter(pk__in=batch).delete()

		pass


		def dedupe_poll_stats_by_questions(apps, schema_editor): # pragma: no cover

-def dedupe_poll_stats_by_questions(apps, schema_editor):  # pragma: no cover
+def dedupe_poll_stats_by_questions(apps, schema_editor):  # pragma: no cover
+    """
+    Deduplicate PollStats entries for FlowResults with multiple associated PollQuestions.
+    For each FlowResult that is linked to two or more PollQuestions, this function:
+. Deletes all PollStats entries for that FlowResult except the one associated with the first PollQuestion.
+. Sets the question field to None for the remaining PollStats entry.
+    This is necessary to ensure that each FlowResult has at most one PollStats entry,
+    and that the PollStats entry is not ambiguously linked to a specific PollQuestion.
+    The expected outcome is that there are no duplicate PollStats per FlowResult,
+    and all remaining PollStats have question=None.
+    """

		for batch in chunk_list(poll_stats_counters_ids, 1000):
		PollStatsCounter.objects.filter(pk__in=list(batch)).delete()

-    Boundary = apps.get_model("locations", "Boundary")
+    Boundary = apps.get_model("locations", "Boundary")
+    # These values are hardcoded here because Django migrations should not import from application code.
+    # If the values of STATE_LEVEL, DISTRICT_LEVEL, or WARD_LEVEL change in the source models/constants,
+    # this migration may need to be updated accordingly.

Rebackfill the poll stats counter and engagement counter after dedupl… #1319

Rebackfill the poll stats counter and engagement counter after dedupl… #1319

Uh oh!

Conversation

norkans7 commented Dec 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Dec 1, 2025 •

edited

Loading