Retry peeruserimport task on Database or connection errors #13821

AlexVelezLl · 2025-10-08T23:22:18Z

Summary

Adds support for a retry_on argument in the @task decorator to specify a list of potential non-deterministic exceptions that can be retried if the job failed because of them.
- The user won't see the task failed until the same task was re-attempted 3 times.
Updates the setupwizard frontend to handle failed tasks.
Updates the setupwizard frontend to persists users being imported.
Disable back button in the import users page when importing users to prevent unexpected page layouts.
Add a semaphore on the frontend to let only 3 task creation requests run at a time.

grab.mov

References

Closes #11836.

Reviewer guidance

Follow steps in Setup Wizard - Confusing behavior when importing multiple learners #11836.
Check that import flow in the LOD users page keeps working as expected.

github-actions · 2025-10-08T23:59:48Z

Build Artifacts

Asset type	Download link
PEX file	kolibri-0.19.1.dev0_git.20260115211129.pex
Windows Installer (EXE)	kolibri-0.19.1.dev0+git.20260115211129-windows-setup-unsigned.exe
Debian Package	kolibri_0.19.1.dev0+git.20260115211129-0ubuntu1_all.deb
Mac Installer (DMG)	kolibri-0.19.1.dev0+git.20260115211129.dmg
Android Package (APK)	kolibri-0.19.1.dev0+git.20260115211129-0.1.7-debug.apk
Raspberry Pi Image	kolibri-pi-image-0.19.1.dev0+git.20260115211129.zip
TAR file	kolibri-0.19.1.dev0+git.20260115211129.tar.gz
WHL file	kolibri-0.19.1.dev0+git.20260115211129-py2.py3-none-any.whl

rtibbles

I think we can maintain the current separation of concerns, and it may be worth the effort of adding a new column to track the retries rather than keeping it in the extra_metadata.

To allow us to migrate the SQLAlchemy table, adding alembic as a dependency feels a bit heavy duty. So perhaps the answer is to clear the jobs table of any finished tasks, then dump the remainder to a temporary CSV, clear the table, recreate, and then reload the data?

rtibbles · 2025-10-10T14:17:16Z

kolibri/core/tasks/decorators.py

    permission_classes=None,
    long_running=False,
    status_fn=None,
+    retry_on=None,


Good job avoiding a classic Python gotcha! (passing mutable values as default arguments, such as [] is a very common mistake that can cause issues)

rtibbles · 2025-10-10T14:21:27Z

kolibri/core/tasks/job.py

        total_progress=0,
        result=None,
        long_running=False,
+        retry_on=None,


I feel like we don't need to store this in the job object - we're not allowing this to be customized per job, only per task - so I think we can just reference this from the task itself, rather than having to pass it in at job initialization. This also saves us having to coerce the exception classes to import paths.

rtibbles · 2025-10-10T14:26:32Z

kolibri/core/tasks/job.py

        )
        setattr(current_state_tracker, "job", None)

+    def should_retry(self, exception):


I think I'd rather defer all this logic to the reschedule_finished_job_if_needed method on the storage class, rather than having it in the job class.

rtibbles · 2025-10-10T14:28:47Z

kolibri/core/tasks/job.py


+    def should_retry(self, exception):
+        retries = self.extra_metadata.get("retries", 0) + 1
+        self.extra_metadata["retries"] = retries


I am a bit iffy about using extra_metadata for tracking this - I think if we want to hack the existing schema, 'repeat' is probably a better place for this, but I wonder if instead we should add to the job table schema to add error_retries so that we can put a sensible default in place for failing tasks so they don't endlessly repeat.

I also think I'd rather have the retry interval defined by the task registration (we could also set a sensible default if retryable exceptions are set).

AlexVelezLl · 2025-12-10T17:00:45Z

kolibri/core/deviceadmin/tests/test_dbrestore.py

-            from django import db
-
-            # Destroy current connections and create new ones:
-            db.connections.close_all()
-            db.connections = db.ConnectionHandler()


I have removed these db.connections overrides and have used the patch("django.db.connections" instead. These overrides were having some side effects on the job tests that involved having multiple threads, and it was messing things up in the teardown process.

However, not sure if removing these lines may cause somehow a false positive in the test.

AlexVelezLl · 2025-12-10T17:01:41Z

kolibri/core/deviceadmin/tests/test_dbrestore.py

-        from django import db
-
-        db.connections["default"].connection = None


idem, will instead rely on the django.db.connections patch. But not sure if this may cause false positives.

rtibbles

The Exception/BaseException validation needs to be cleaned up, as well as the DatabaseLockedError, as I don't think it will catch what we are hoping it will catch.

The Pragma setting, if it's not being done for the additional databases can be deferred to follow up.

Import of storage from main is not a blocker, just a thought.

rtibbles · 2026-01-05T19:41:45Z

kolibri/core/tasks/registry.py

+        if not isinstance(retry_on, list):
+            raise TypeError("retry_on must be a list of exceptions")
+        for item in retry_on:
+            if not issubclass(item, Exception):


We should change this to BaseException - it's a little uncommon, but sometimes exceptions are subclassed from this rather than the Exception class: https://docs.python.org/3/library/exceptions.html#BaseException

rtibbles · 2026-01-05T19:44:45Z

kolibri/core/tasks/storage.py

    def set_sqlite_pragmas(self):
        """
-        Sets the connection PRAGMAs for the sqlalchemy engine stored in self.engine.
+        Sets the connection PRAGMAs for the sqlite database.


Now this is managed via Django... I think we should be doing this already, and if we're not doing it for all of the additional DBs, we should be.

Yes! I recall that we were just doing this for the default db, thats why I kept this function here

I think this may solve this be1438b

rtibbles · 2026-01-05T19:54:18Z

kolibri/core/tasks/storage.py

+
    def _update_job(self, job_id, state=None, **kwargs):
-        with self.session_scope() as session:
+        with transaction.atomic(using=self._get_job_database_alias()):


I assume this is needed because transaction.atomic by default only operates on the default database?

rtibbles · 2026-01-05T19:55:20Z

kolibri/core/tasks/storage.py

+                "saved_job": job.to_json(),
+            }
+
+            if orm_job:


Could potentially use update_or_create here - but given that we already know, this seems fine to me.

rtibbles · 2026-01-05T19:59:21Z

kolibri/core/tasks/utils.py

    return executor(max_workers=max_workers)
+
+
+class DatabaseLockedError(OperationalError):


I am not sure when this would ever get raised, because we have defined it here, but then we are never using it?

For this to work, it would have to be raised by the sync task that has it as an exception that it can retry on? We have some similar logic in our middleware that raises 502s on requests - perhaps we could create a broader context manager that catches OperationalErrors and reraises them as DatabaseLockedErrors if it meets the criterion?

Was a bit confused when creating this class, removed it, and used the OperationalError class instead!

rtibbles · 2026-01-05T19:59:57Z

kolibri/core/tasks/validation.py

        raise TypeError("time delay must be a datetime.timedelta object")


+def validate_exception(value):


Is this being used? It seems that this validation was happening inline elsewhere? (noting that here BaseException is being used though!)

Yes! It is being used here https://github.com/AlexVelezLl/kolibri/blob/402dadd608f01e48679bb4d528389d5ee93553f4/kolibri/core/tasks/storage.py#L517.

I think the inline validation you are talking about is this one https://github.com/AlexVelezLl/kolibri/blob/fix-lod-import-multi-users/kolibri/core/tasks/registry.py#L270, but that one is validating the class; this validate_exception is validating the object.

rtibbles · 2026-01-05T20:00:28Z

kolibri/core/tasks/worker.py

-    connection = db_connection()
-
-    storage = Storage(connection)
+    storage = Storage()


I wonder.. could we just import the storage object from main here?

Seems like a good idea! 😅

rtibbles · 2026-01-05T20:00:44Z

kolibri/core/tasks/worker.py

        self.future_job_mapping = {}

-        self.storage = Storage(connection)
+        self.storage = Storage()


Likewise here.

AlexVelezLl · 2026-01-05T21:29:33Z

Thanks @rtibbles! I have addressed all your comments!

rtibbles

All my comments are addressed! Let's get this QAed.

All comments addressed.

rtibbles · 2026-01-05T23:09:44Z

@pcenov @radinamatic hopefully the issue has the details needed for replication - this has ended up being a slightly larger refactor, so doing some additional smoke tests of some async tasks, like content imports, and also checking some different syncs also.

There's also a possibility for regression in the Android App, so if we can test the import workflows on Android as well, that would be very helpful.

AlexVelezLl · 2026-01-09T17:06:21Z

Thanks @pcenov. Apologies, I should have noted this. These concurrent imports may overload the remote server (and potentially the local server), and we have no control over when it becomes overloaded and crashes. We can limit the number of jobs running synchronously, or we can disable the import button if there are X users being imported at the same time. We can also increase the number of times we try to connect to the server cc: @rtibbles, but there is still a possibility that any of the servers go down. This will heavily depend on the system's resources (For example, I was able to import up to 48 users concurrently on my laptop without any errors). The most important update is that even if the import fails, the user can try again, and they will eventually be able to do so because these errors are not caused by corrupted data.

In general, a great part of the updates introduced by this PR are: Let's retry as much as possible in case any of these errors that are beyond our control are thrown, so that we can minimize the number of errors the final user sees.

AlexVelezLl · 2026-01-09T17:42:20Z

After chatting with @rtibbles. We will try one additional strategy to minimize the number of errors seen by the user even further. Will push it by today before EOD.

rtibbles · 2026-01-11T04:57:13Z

To clarify, this is ready for another round of QA!

pcenov · 2026-01-12T14:56:02Z

Hi @AlexVelezLl - when I created a brand new Windows server I was able to successfully import a number of learners, and then import the few ones for which the import was initially not working. Then the syncing was working correctly for both my Ubuntu LOD and Android LOD:

Logs:
AndroidLODLogs.zip
UbuntuLODLogs.zip
WindowsServerLogs.zip

At the same time after additional testing I found myself into a situation where several learners were stuck in perpetual loading state, it was no longer possible to import the ones for which the import had failed and since the Back arrow is now disabled there was no clear way to proceed:

imprt.stuck.mp4

I had to restart the server and then I was able to proceed with the import but then the syncing was not working correctly even after restarting both the server and the the LOD VM:

syncing.not.working.mp4

Logs:

WindowsServerLogsAndDB.zip
UbuntuLODLogsAndDB.zip

I was able to replicate the stoppage of the syncing with both the Android and Mac apps too:

AndroidLogsDB.zip

MacLogs.zip

Additionally I was able to get into a situation where changes made on the server were being synced correctly to the LOD but the changes made on the LOD were not synced back to the server:

learner.progress.not.synced.to.the.server.mp4

Syncing is also not working when using my Ubuntu VM as a server:

UbuntuServerLogsDB.zip
AndroidLODLogsDB.zip

…e time

…already

AlexVelezLl · 2026-01-13T22:52:38Z

Hi @pcenov. I have fixed the perpetual loading state. Regarding the syncing errors, these will require some design to get fixed. Right now, it doesn't work well with LOD devices that have a lot of users (this should also happen in development, but it is more apparent now that we are able to import more users in an easier way). I will file a follow-up issue tomorrow for this.

rtibbles

Just one question left - sorry, forgot to submit this previously!

rtibbles · 2026-01-13T21:23:55Z

kolibri/core/auth/tasks.py

 @register_task(
    job_id=SOUD_SYNC_PROCESSING_JOB_ID,
    queue=soud_sync_queue,
-    priority=Priority.HIGH,


Is this intentional? We want this task to be high priority.

Ah, apologies, I overlooked this

rtibbles

I think this should be ready for more QA!

pcenov · 2026-01-14T15:54:05Z

Hi @AlexVelezLl I confirm that I was not able to replicate the error where several learners were stuck in perpetual loading state but for some reason the following workflows are no longer working in the latest build here:

LOD > Create a new user account for an existing facility / Import individual user accounts (for a single user import)

create.new.account.mp4

UserIDerrorLogs.zip

On my own > Change facility:

change.facility.mp4

ubuntuLODchangefacilityLogs.zip

I'm replicating the errors on Android too and also when using Ubuntu as a server, so it's not something device specific.

AlexVelezLl · 2026-01-14T16:22:36Z

Thanks @pcenov, will take a look

AlexVelezLl · 2026-01-14T16:32:16Z

Just tested it again, and this should be fixed now. Sorry for the inconvenience.

pcenov · 2026-01-15T13:04:34Z

Hi @AlexVelezLl - I confirm that the above mentioned issues are fixed.

I did stumble on another edge case which I can file separately if you wish so - while importing multiple users I had to stop the server and when I started it one of the users remained stuck in perpetual loading state. Perhaps there should be a cancel option for such cases?

loading.mp4

WinServer.zip
UbuntuLodLogs.zip

Also a reminder to file a separate issue for the syncing issues when there are multiple imported learners on the device. :)

Other than that, everything else looks OK!

AlexVelezLl · 2026-01-15T15:08:01Z

Hi @pcenov! Could you please also attach the file job_storage.sqlite3 of the LOD server? Thanks!

pcenov · 2026-01-15T15:17:20Z

Hi Alex, here you go: job_storage.zip

rtibbles · 2026-01-15T20:59:28Z

kolibri/plugins/setup_wizard/frontend/views/ImportMultipleUsers.vue

+        this.learnersTaskCreationLoading.push(learner.id);
+        await this.enqueue(() => this.startImport(learner));
+
+        this.setTimeout(() => {


Just as reminder for me, will the setTimeout function properly cleanup the callback when it's done?

It should, yes. We can also clear the setTimeout object on the component unmount hook to prevent its execution if the component is unmounted.

(just fyi) Just pushed a small change to use setTimeout instead of this.setTimeout. This was somehow working fine, but just to use the correct syntax!

OK - and as its setTimeout not setInterval the worst that will happen is we get an error after teardown if we don't clean this up?

AlexVelezLl requested a review from rtibbles October 8, 2025 23:22

github-actions bot added DEV: backend Python, databases, networking, filesystem... APP: Setup Wizard Re: Setup Wizard (facility import, superuser creation, settings, etc.) DEV: frontend SIZE: medium labels Oct 8, 2025

AlexVelezLl added this to the Kolibri 0.19: Bulk User Management milestone Oct 8, 2025

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 4e2fbc9 to 236f654 Compare October 8, 2025 23:26

rtibbles self-assigned this Oct 9, 2025

rtibbles reviewed Oct 10, 2025

View reviewed changes

github-actions bot added the SIZE: large label Oct 16, 2025

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 6a0c872 to 55e3dba Compare October 24, 2025 20:10

github-actions bot added the SIZE: very large label Oct 24, 2025

marcellamaki modified the milestones: Kolibri 0.19: Bulk User Management, Kolibri 0.19: Planned Patch 1 Oct 29, 2025

AlexVelezLl force-pushed the fix-lod-import-multi-users branch 2 times, most recently from a2765cb to 1a2f204 Compare December 10, 2025 15:18

AlexVelezLl commented Dec 10, 2025

View reviewed changes

AlexVelezLl requested a review from rtibbles December 10, 2025 22:25

rtibbles changed the base branch from develop to release-v0.19.x December 17, 2025 21:39

AlexVelezLl force-pushed the fix-lod-import-multi-users branch 2 times, most recently from cd9821f to 402dadd Compare January 5, 2026 19:29

rtibbles previously requested changes Jan 5, 2026

View reviewed changes

AlexVelezLl requested a review from rtibbles January 5, 2026 21:28

rtibbles reviewed Jan 5, 2026

View reviewed changes

pcenov self-requested a review January 6, 2026 08:59

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from ae10224 to eeb1d67 Compare January 9, 2026 20:23

AlexVelezLl added 3 commits January 13, 2026 09:46

Add more safeguards against reaching overloaded server

b77055c

Update kolibri-installer-android and morango versions

790d008

Add semaphore to limit the number of task creation request at the sam…

767dd42

…e time

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from eeb1d67 to 767dd42 Compare January 13, 2026 14:46

Enqueue automatic content download only if there isn't an active job …

e7e8687

…already

AlexVelezLl force-pushed the fix-lod-import-multi-users branch 2 times, most recently from 7e806e9 to 883e1d1 Compare January 13, 2026 22:14

rtibbles reviewed Jan 13, 2026

View reviewed changes

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 883e1d1 to ce670d7 Compare January 14, 2026 00:15

rtibbles reviewed Jan 14, 2026

View reviewed changes

AlexVelezLl added 2 commits January 14, 2026 11:30

Query NetworkLocation model for building NetworkConnection from address

e07abce

Fix mktime bug

fc73510

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from d664e86 to fc73510 Compare January 14, 2026 16:30

rtibbles reviewed Jan 15, 2026

View reviewed changes

Remove users being imported missing on tasks

6982ecd

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 58f9bb4 to 6982ecd Compare January 15, 2026 21:11

nucleogenesis modified the milestones: Kolibri 0.19: Planned Patch 1, Kolibri 0.19: Planned Patch 2 Jan 16, 2026

		from django import db

		db.connections["default"].connection = None

		return executor(max_workers=max_workers)


		class DatabaseLockedError(OperationalError):

		raise TypeError("time delay must be a datetime.timedelta object")


		def validate_exception(value):

Retry peeruserimport task on Database or connection errors #13821

Are you sure you want to change the base?

Retry peeruserimport task on Database or connection errors #13821

Conversation

AlexVelezLl commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

References

Reviewer guidance

Uh oh!

github-actions bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexVelezLl Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexVelezLl commented Jan 5, 2026

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

rtibbles commented Jan 5, 2026

Uh oh!

AlexVelezLl commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexVelezLl commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtibbles commented Jan 11, 2026

Uh oh!

pcenov commented Jan 12, 2026

Uh oh!

AlexVelezLl commented Jan 13, 2026

Uh oh!

AlexVelezLl commented Oct 8, 2025 •

edited

Loading

github-actions bot commented Oct 8, 2025 •

edited

Loading

AlexVelezLl Dec 10, 2025 •

edited

Loading

AlexVelezLl commented Jan 9, 2026 •

edited

Loading

AlexVelezLl commented Jan 9, 2026 •

edited

Loading