Skip to content

Conversation

@AlexVelezLl
Copy link
Member

@AlexVelezLl AlexVelezLl commented Oct 8, 2025

Summary

  • Adds support for a retry_on argument in the @task decorator to specify a list of potential non-deterministic exceptions that can be retried if the job failed because of them.
    • The user won't see the task failed until the same task was re-attempted 3 times.
  • Updates the setupwizard frontend to handle failed tasks.
  • Updates the setupwizard frontend to persists users being imported.
  • Disable back button in the import users page when importing users to prevent unexpected page layouts.
  • Add a semaphore on the frontend to let only 3 task creation requests run at a time.
grab.mov

References

Closes #11836.

Reviewer guidance

@AlexVelezLl AlexVelezLl requested a review from rtibbles October 8, 2025 23:22
@github-actions github-actions bot added DEV: backend Python, databases, networking, filesystem... APP: Setup Wizard Re: Setup Wizard (facility import, superuser creation, settings, etc.) DEV: frontend SIZE: medium labels Oct 8, 2025
@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 4e2fbc9 to 236f654 Compare October 8, 2025 23:26
@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2025

@rtibbles rtibbles self-assigned this Oct 9, 2025
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can maintain the current separation of concerns, and it may be worth the effort of adding a new column to track the retries rather than keeping it in the extra_metadata.

To allow us to migrate the SQLAlchemy table, adding alembic as a dependency feels a bit heavy duty. So perhaps the answer is to clear the jobs table of any finished tasks, then dump the remainder to a temporary CSV, clear the table, recreate, and then reload the data?

permission_classes=None,
long_running=False,
status_fn=None,
retry_on=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job avoiding a classic Python gotcha! (passing mutable values as default arguments, such as [] is a very common mistake that can cause issues)

total_progress=0,
result=None,
long_running=False,
retry_on=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we don't need to store this in the job object - we're not allowing this to be customized per job, only per task - so I think we can just reference this from the task itself, rather than having to pass it in at job initialization. This also saves us having to coerce the exception classes to import paths.

)
setattr(current_state_tracker, "job", None)

def should_retry(self, exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather defer all this logic to the reschedule_finished_job_if_needed method on the storage class, rather than having it in the job class.


def should_retry(self, exception):
retries = self.extra_metadata.get("retries", 0) + 1
self.extra_metadata["retries"] = retries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit iffy about using extra_metadata for tracking this - I think if we want to hack the existing schema, 'repeat' is probably a better place for this, but I wonder if instead we should add to the job table schema to add error_retries so that we can put a sensible default in place for failing tasks so they don't endlessly repeat.

I also think I'd rather have the retry interval defined by the task registration (we could also set a sensible default if retryable exceptions are set).

Comment on lines -145 to -149
from django import db

# Destroy current connections and create new ones:
db.connections.close_all()
db.connections = db.ConnectionHandler()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed these db.connections overrides and have used the patch("django.db.connections" instead. These overrides were having some side effects on the job tests that involved having multiple threads, and it was messing things up in the teardown process.

However, not sure if removing these lines may cause somehow a false positive in the test.

Comment on lines -206 to -208
from django import db

db.connections["default"].connection = None
Copy link
Member Author

@AlexVelezLl AlexVelezLl Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem, will instead rely on the django.db.connections patch. But not sure if this may cause false positives.

@AlexVelezLl AlexVelezLl requested a review from rtibbles December 10, 2025 22:25
@rtibbles rtibbles changed the base branch from develop to release-v0.19.x December 17, 2025 21:39
@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch 2 times, most recently from cd9821f to 402dadd Compare January 5, 2026 19:29
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Exception/BaseException validation needs to be cleaned up, as well as the DatabaseLockedError, as I don't think it will catch what we are hoping it will catch.

The Pragma setting, if it's not being done for the additional databases can be deferred to follow up.

Import of storage from main is not a blocker, just a thought.

if not isinstance(retry_on, list):
raise TypeError("retry_on must be a list of exceptions")
for item in retry_on:
if not issubclass(item, Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this to BaseException - it's a little uncommon, but sometimes exceptions are subclassed from this rather than the Exception class: https://docs.python.org/3/library/exceptions.html#BaseException

def set_sqlite_pragmas(self):
"""
Sets the connection PRAGMAs for the sqlalchemy engine stored in self.engine.
Sets the connection PRAGMAs for the sqlite database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is managed via Django... I think we should be doing this already, and if we're not doing it for all of the additional DBs, we should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I recall that we were just doing this for the default db, thats why I kept this function here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may solve this be1438b


def _update_job(self, job_id, state=None, **kwargs):
with self.session_scope() as session:
with transaction.atomic(using=self._get_job_database_alias()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is needed because transaction.atomic by default only operates on the default database?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

"saved_job": job.to_json(),
}

if orm_job:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could potentially use update_or_create here - but given that we already know, this seems fine to me.

return executor(max_workers=max_workers)


class DatabaseLockedError(OperationalError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure when this would ever get raised, because we have defined it here, but then we are never using it?

For this to work, it would have to be raised by the sync task that has it as an exception that it can retry on? We have some similar logic in our middleware that raises 502s on requests - perhaps we could create a broader context manager that catches OperationalErrors and reraises them as DatabaseLockedErrors if it meets the criterion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a bit confused when creating this class, removed it, and used the OperationalError class instead!

raise TypeError("time delay must be a datetime.timedelta object")


def validate_exception(value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used? It seems that this validation was happening inline elsewhere? (noting that here BaseException is being used though!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It is being used here https://github.com/AlexVelezLl/kolibri/blob/402dadd608f01e48679bb4d528389d5ee93553f4/kolibri/core/tasks/storage.py#L517.

I think the inline validation you are talking about is this one https://github.com/AlexVelezLl/kolibri/blob/fix-lod-import-multi-users/kolibri/core/tasks/registry.py#L270, but that one is validating the class; this validate_exception is validating the object.

connection = db_connection()

storage = Storage(connection)
storage = Storage()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder.. could we just import the storage object from main here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea! 😅

self.future_job_mapping = {}

self.storage = Storage(connection)
self.storage = Storage()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here.

@AlexVelezLl AlexVelezLl requested a review from rtibbles January 5, 2026 21:28
@AlexVelezLl
Copy link
Member Author

Thanks @rtibbles! I have addressed all your comments!

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments are addressed! Let's get this QAed.

@rtibbles rtibbles dismissed their stale review January 5, 2026 23:07

All comments addressed.

@rtibbles
Copy link
Member

rtibbles commented Jan 5, 2026

@pcenov @radinamatic hopefully the issue has the details needed for replication - this has ended up being a slightly larger refactor, so doing some additional smoke tests of some async tasks, like content imports, and also checking some different syncs also.

There's also a possibility for regression in the Android App, so if we can test the import workflows on Android as well, that would be very helpful.

@pcenov pcenov self-requested a review January 6, 2026 08:59
@AlexVelezLl
Copy link
Member Author

AlexVelezLl commented Jan 9, 2026

Thanks @pcenov. Apologies, I should have noted this. These concurrent imports may overload the remote server (and potentially the local server), and we have no control over when it becomes overloaded and crashes. We can limit the number of jobs running synchronously, or we can disable the import button if there are X users being imported at the same time. We can also increase the number of times we try to connect to the server cc: @rtibbles, but there is still a possibility that any of the servers go down. This will heavily depend on the system's resources (For example, I was able to import up to 48 users concurrently on my laptop without any errors). The most important update is that even if the import fails, the user can try again, and they will eventually be able to do so because these errors are not caused by corrupted data.

In general, a great part of the updates introduced by this PR are: Let's retry as much as possible in case any of these errors that are beyond our control are thrown, so that we can minimize the number of errors the final user sees.

@AlexVelezLl
Copy link
Member Author

AlexVelezLl commented Jan 9, 2026

After chatting with @rtibbles. We will try one additional strategy to minimize the number of errors seen by the user even further. Will push it by today before EOD.

@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch from ae10224 to eeb1d67 Compare January 9, 2026 20:23
@rtibbles
Copy link
Member

To clarify, this is ready for another round of QA!

@pcenov
Copy link
Member

pcenov commented Jan 12, 2026

Hi @AlexVelezLl - when I created a brand new Windows server I was able to successfully import a number of learners, and then import the few ones for which the import was initially not working. Then the syncing was working correctly for both my Ubuntu LOD and Android LOD:

Logs:
AndroidLODLogs.zip
UbuntuLODLogs.zip
WindowsServerLogs.zip

At the same time after additional testing I found myself into a situation where several learners were stuck in perpetual loading state, it was no longer possible to import the ones for which the import had failed and since the Back arrow is now disabled there was no clear way to proceed:

imprt.stuck.mp4

I had to restart the server and then I was able to proceed with the import but then the syncing was not working correctly even after restarting both the server and the the LOD VM:

syncing.not.working.mp4

Logs:

WindowsServerLogsAndDB.zip
UbuntuLODLogsAndDB.zip

I was able to replicate the stoppage of the syncing with both the Android and Mac apps too:

AndroidLogsDB.zip

MacLogs.zip

Additionally I was able to get into a situation where changes made on the server were being synced correctly to the LOD but the changes made on the LOD were not synced back to the server:

learner.progress.not.synced.to.the.server.mp4

Syncing is also not working when using my Ubuntu VM as a server:

UbuntuServerLogsDB.zip
AndroidLODLogsDB.zip

@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch from eeb1d67 to 767dd42 Compare January 13, 2026 14:46
@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch 2 times, most recently from 7e806e9 to 883e1d1 Compare January 13, 2026 22:14
@AlexVelezLl
Copy link
Member Author

Hi @pcenov. I have fixed the perpetual loading state. Regarding the syncing errors, these will require some design to get fixed. Right now, it doesn't work well with LOD devices that have a lot of users (this should also happen in development, but it is more apparent now that we are able to import more users in an easier way). I will file a follow-up issue tomorrow for this.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question left - sorry, forgot to submit this previously!

@register_task(
job_id=SOUD_SYNC_PROCESSING_JOB_ID,
queue=soud_sync_queue,
priority=Priority.HIGH,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? We want this task to be high priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, apologies, I overlooked this

@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 883e1d1 to ce670d7 Compare January 14, 2026 00:15
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be ready for more QA!

@pcenov
Copy link
Member

pcenov commented Jan 14, 2026

Hi @AlexVelezLl I confirm that I was not able to replicate the error where several learners were stuck in perpetual loading state but for some reason the following workflows are no longer working in the latest build here:

  1. LOD > Create a new user account for an existing facility / Import individual user accounts (for a single user import)
create.new.account.mp4

UserIDerrorLogs.zip

  1. On my own > Change facility:
change.facility.mp4

ubuntuLODchangefacilityLogs.zip

I'm replicating the errors on Android too and also when using Ubuntu as a server, so it's not something device specific.

@AlexVelezLl
Copy link
Member Author

Thanks @pcenov, will take a look

@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch from d664e86 to fc73510 Compare January 14, 2026 16:30
@AlexVelezLl
Copy link
Member Author

Just tested it again, and this should be fixed now. Sorry for the inconvenience.

@pcenov
Copy link
Member

pcenov commented Jan 15, 2026

Hi @AlexVelezLl - I confirm that the above mentioned issues are fixed.

I did stumble on another edge case which I can file separately if you wish so - while importing multiple users I had to stop the server and when I started it one of the users remained stuck in perpetual loading state. Perhaps there should be a cancel option for such cases?

loading.mp4

WinServer.zip
UbuntuLodLogs.zip

Also a reminder to file a separate issue for the syncing issues when there are multiple imported learners on the device. :)

Other than that, everything else looks OK!

@AlexVelezLl
Copy link
Member Author

Hi @pcenov! Could you please also attach the file job_storage.sqlite3 of the LOD server? Thanks!

@pcenov
Copy link
Member

pcenov commented Jan 15, 2026

Hi Alex, here you go: job_storage.zip

this.learnersTaskCreationLoading.push(learner.id);
await this.enqueue(() => this.startImport(learner));

this.setTimeout(() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as reminder for me, will the setTimeout function properly cleanup the callback when it's done?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, yes. We can also clear the setTimeout object on the component unmount hook to prevent its execution if the component is unmounted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just fyi) Just pushed a small change to use setTimeout instead of this.setTimeout. This was somehow working fine, but just to use the correct syntax!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - and as its setTimeout not setInterval the worst that will happen is we get an error after teardown if we don't clean this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

APP: Setup Wizard Re: Setup Wizard (facility import, superuser creation, settings, etc.) DEV: backend Python, databases, networking, filesystem... DEV: dev-ops Continuous integration & deployment DEV: frontend SIZE: large SIZE: medium SIZE: very large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Setup Wizard - Confusing behavior when importing multiple learners

5 participants