-
Notifications
You must be signed in to change notification settings - Fork 878
Retry peeruserimport task on Database or connection errors #13821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-v0.19.x
Are you sure you want to change the base?
Retry peeruserimport task on Database or connection errors #13821
Conversation
4e2fbc9 to
236f654
Compare
Build Artifacts
|
rtibbles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can maintain the current separation of concerns, and it may be worth the effort of adding a new column to track the retries rather than keeping it in the extra_metadata.
To allow us to migrate the SQLAlchemy table, adding alembic as a dependency feels a bit heavy duty. So perhaps the answer is to clear the jobs table of any finished tasks, then dump the remainder to a temporary CSV, clear the table, recreate, and then reload the data?
| permission_classes=None, | ||
| long_running=False, | ||
| status_fn=None, | ||
| retry_on=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job avoiding a classic Python gotcha! (passing mutable values as default arguments, such as [] is a very common mistake that can cause issues)
kolibri/core/tasks/job.py
Outdated
| total_progress=0, | ||
| result=None, | ||
| long_running=False, | ||
| retry_on=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we don't need to store this in the job object - we're not allowing this to be customized per job, only per task - so I think we can just reference this from the task itself, rather than having to pass it in at job initialization. This also saves us having to coerce the exception classes to import paths.
kolibri/core/tasks/job.py
Outdated
| ) | ||
| setattr(current_state_tracker, "job", None) | ||
|
|
||
| def should_retry(self, exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather defer all this logic to the reschedule_finished_job_if_needed method on the storage class, rather than having it in the job class.
kolibri/core/tasks/job.py
Outdated
|
|
||
| def should_retry(self, exception): | ||
| retries = self.extra_metadata.get("retries", 0) + 1 | ||
| self.extra_metadata["retries"] = retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit iffy about using extra_metadata for tracking this - I think if we want to hack the existing schema, 'repeat' is probably a better place for this, but I wonder if instead we should add to the job table schema to add error_retries so that we can put a sensible default in place for failing tasks so they don't endlessly repeat.
I also think I'd rather have the retry interval defined by the task registration (we could also set a sensible default if retryable exceptions are set).
6a0c872 to
55e3dba
Compare
a2765cb to
1a2f204
Compare
| from django import db | ||
|
|
||
| # Destroy current connections and create new ones: | ||
| db.connections.close_all() | ||
| db.connections = db.ConnectionHandler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed these db.connections overrides and have used the patch("django.db.connections" instead. These overrides were having some side effects on the job tests that involved having multiple threads, and it was messing things up in the teardown process.
However, not sure if removing these lines may cause somehow a false positive in the test.
| from django import db | ||
|
|
||
| db.connections["default"].connection = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem, will instead rely on the django.db.connections patch. But not sure if this may cause false positives.
cd9821f to
402dadd
Compare
rtibbles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Exception/BaseException validation needs to be cleaned up, as well as the DatabaseLockedError, as I don't think it will catch what we are hoping it will catch.
The Pragma setting, if it's not being done for the additional databases can be deferred to follow up.
Import of storage from main is not a blocker, just a thought.
kolibri/core/tasks/registry.py
Outdated
| if not isinstance(retry_on, list): | ||
| raise TypeError("retry_on must be a list of exceptions") | ||
| for item in retry_on: | ||
| if not issubclass(item, Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change this to BaseException - it's a little uncommon, but sometimes exceptions are subclassed from this rather than the Exception class: https://docs.python.org/3/library/exceptions.html#BaseException
kolibri/core/tasks/storage.py
Outdated
| def set_sqlite_pragmas(self): | ||
| """ | ||
| Sets the connection PRAGMAs for the sqlalchemy engine stored in self.engine. | ||
| Sets the connection PRAGMAs for the sqlite database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is managed via Django... I think we should be doing this already, and if we're not doing it for all of the additional DBs, we should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I recall that we were just doing this for the default db, thats why I kept this function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may solve this be1438b
|
|
||
| def _update_job(self, job_id, state=None, **kwargs): | ||
| with self.session_scope() as session: | ||
| with transaction.atomic(using=self._get_job_database_alias()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is needed because transaction.atomic by default only operates on the default database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
| "saved_job": job.to_json(), | ||
| } | ||
|
|
||
| if orm_job: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could potentially use update_or_create here - but given that we already know, this seems fine to me.
kolibri/core/tasks/utils.py
Outdated
| return executor(max_workers=max_workers) | ||
|
|
||
|
|
||
| class DatabaseLockedError(OperationalError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure when this would ever get raised, because we have defined it here, but then we are never using it?
For this to work, it would have to be raised by the sync task that has it as an exception that it can retry on? We have some similar logic in our middleware that raises 502s on requests - perhaps we could create a broader context manager that catches OperationalErrors and reraises them as DatabaseLockedErrors if it meets the criterion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was a bit confused when creating this class, removed it, and used the OperationalError class instead!
| raise TypeError("time delay must be a datetime.timedelta object") | ||
|
|
||
|
|
||
| def validate_exception(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being used? It seems that this validation was happening inline elsewhere? (noting that here BaseException is being used though!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! It is being used here https://github.com/AlexVelezLl/kolibri/blob/402dadd608f01e48679bb4d528389d5ee93553f4/kolibri/core/tasks/storage.py#L517.
I think the inline validation you are talking about is this one https://github.com/AlexVelezLl/kolibri/blob/fix-lod-import-multi-users/kolibri/core/tasks/registry.py#L270, but that one is validating the class; this validate_exception is validating the object.
kolibri/core/tasks/worker.py
Outdated
| connection = db_connection() | ||
|
|
||
| storage = Storage(connection) | ||
| storage = Storage() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder.. could we just import the storage object from main here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea! 😅
kolibri/core/tasks/worker.py
Outdated
| self.future_job_mapping = {} | ||
|
|
||
| self.storage = Storage(connection) | ||
| self.storage = Storage() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here.
|
Thanks @rtibbles! I have addressed all your comments! |
rtibbles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my comments are addressed! Let's get this QAed.
|
@pcenov @radinamatic hopefully the issue has the details needed for replication - this has ended up being a slightly larger refactor, so doing some additional smoke tests of some async tasks, like content imports, and also checking some different syncs also. There's also a possibility for regression in the Android App, so if we can test the import workflows on Android as well, that would be very helpful. |
|
Thanks @pcenov. Apologies, I should have noted this. These concurrent imports may overload the remote server (and potentially the local server), and we have no control over when it becomes overloaded and crashes. We can limit the number of jobs running synchronously, or we can disable the import button if there are X users being imported at the same time. We can also increase the number of times we try to connect to the server cc: @rtibbles, but there is still a possibility that any of the servers go down. This will heavily depend on the system's resources (For example, I was able to import up to 48 users concurrently on my laptop without any errors). The most important update is that even if the import fails, the user can try again, and they will eventually be able to do so because these errors are not caused by corrupted data. In general, a great part of the updates introduced by this PR are: Let's retry as much as possible in case any of these errors that are beyond our control are thrown, so that we can minimize the number of errors the final user sees. |
|
After chatting with @rtibbles. We will try one additional strategy to minimize the number of errors seen by the user even further. Will push it by today before EOD. |
ae10224 to
eeb1d67
Compare
|
To clarify, this is ready for another round of QA! |
|
Hi @AlexVelezLl - when I created a brand new Windows server I was able to successfully import a number of learners, and then import the few ones for which the import was initially not working. Then the syncing was working correctly for both my Ubuntu LOD and Android LOD: Logs: At the same time after additional testing I found myself into a situation where several learners were stuck in perpetual loading state, it was no longer possible to import the ones for which the import had failed and since the Back arrow is now disabled there was no clear way to proceed: imprt.stuck.mp4I had to restart the server and then I was able to proceed with the import but then the syncing was not working correctly even after restarting both the server and the the LOD VM: syncing.not.working.mp4Logs: WindowsServerLogsAndDB.zip I was able to replicate the stoppage of the syncing with both the Android and Mac apps too: Additionally I was able to get into a situation where changes made on the server were being synced correctly to the LOD but the changes made on the LOD were not synced back to the server: learner.progress.not.synced.to.the.server.mp4Syncing is also not working when using my Ubuntu VM as a server: |
eeb1d67 to
767dd42
Compare
7e806e9 to
883e1d1
Compare
|
Hi @pcenov. I have fixed the perpetual loading state. Regarding the syncing errors, these will require some design to get fixed. Right now, it doesn't work well with LOD devices that have a lot of users (this should also happen in development, but it is more apparent now that we are able to import more users in an easier way). I will file a follow-up issue tomorrow for this. |
rtibbles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question left - sorry, forgot to submit this previously!
| @register_task( | ||
| job_id=SOUD_SYNC_PROCESSING_JOB_ID, | ||
| queue=soud_sync_queue, | ||
| priority=Priority.HIGH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? We want this task to be high priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, apologies, I overlooked this
883e1d1 to
ce670d7
Compare
rtibbles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be ready for more QA!
|
Hi @AlexVelezLl I confirm that I was not able to replicate the error where several learners were stuck in perpetual loading state but for some reason the following workflows are no longer working in the latest build here:
create.new.account.mp4
change.facility.mp4ubuntuLODchangefacilityLogs.zip I'm replicating the errors on Android too and also when using Ubuntu as a server, so it's not something device specific. |
|
Thanks @pcenov, will take a look |
d664e86 to
fc73510
Compare
|
Just tested it again, and this should be fixed now. Sorry for the inconvenience. |
|
Hi @AlexVelezLl - I confirm that the above mentioned issues are fixed. I did stumble on another edge case which I can file separately if you wish so - while importing multiple users I had to stop the server and when I started it one of the users remained stuck in perpetual loading state. Perhaps there should be a cancel option for such cases? loading.mp4WinServer.zip Also a reminder to file a separate issue for the syncing issues when there are multiple imported learners on the device. :) Other than that, everything else looks OK! |
|
Hi @pcenov! Could you please also attach the file |
|
Hi Alex, here you go: job_storage.zip |
| this.learnersTaskCreationLoading.push(learner.id); | ||
| await this.enqueue(() => this.startImport(learner)); | ||
|
|
||
| this.setTimeout(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as reminder for me, will the setTimeout function properly cleanup the callback when it's done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should, yes. We can also clear the setTimeout object on the component unmount hook to prevent its execution if the component is unmounted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just fyi) Just pushed a small change to use setTimeout instead of this.setTimeout. This was somehow working fine, but just to use the correct syntax!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - and as its setTimeout not setInterval the worst that will happen is we get an error after teardown if we don't clean this up?
58f9bb4 to
6982ecd
Compare
Summary
retry_onargument in the@taskdecorator to specify a list of potential non-deterministic exceptions that can be retried if the job failed because of them.grab.mov
References
Closes #11836.
Reviewer guidance