-
-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[13.0] Support multi-nodes with lock on jobrunner #256
Conversation
Tested only locally for now, but I'm taking reviews. |
Hi Guewen, thanks for tackling this topic. So if I understand correctly if there are multiple databases we could in theory have some databases handled by one runner and others by another runner? |
Yes, that's a possible scenario. If 2 jobrunners are started at the same time for the same set of databases, they'll each race for the lock and could each acquire only some of them. |
If there is more than one active jobrunner (even on different databases) it could overload the system because each will have it's own I was wondering if we could acquire only one advisory lock on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We run a multi node setup with a similar patch that applies an advisory lock to prevent duplicate jobrunners. Works well.
Right, root channels are global across databases. I overlooked this.
It means we have to keep open a new connection on |
Or acquiring the lock on the first database in alphabetical order ? Dangerous if one database gets added, though. |
Access to the |
Totally, the bus wouldn't work without access to postgres: For the record, I thought about using a single connection to |
Also, is a single connection to Postgres not a problem for setups with multiple Odoo instances querying the same postgresql instance? |
If we make the advisory lock name configurable that may work. |
That additional connection to |
I was reaching to the same point. We have a PostgreSQL cluster with several odoo instances, we have one jobrunner per database (
By default, when we start 2 jobrunners, one with |
Yes, and you could normalize (i.e. sort) the list of databases. |
I pushed a |
@guewen Hi, guewen, thanks for your work on this issue. I had a try with this commit on odoo.sh, unfortunately it didn't work in our project. Here is what I did:
I hope this info would help you in this case. If I did it in a wrong way, please let me know too, I would love to try it again, thx |
thanks @lilee115 , I guess I should finally try to run it on odoo.sh to investigate |
5043c49
to
5aad591
Compare
@lilee115 it seems the issue you mention was caused by the anonymous session which could not be retrieved. After merging #252 and rebasing my branch on top of it, I could test my branch on odoo.sh, with jobs properly executing. I pushed a new commit for the required configuration on odoo.sh. I couldn't test with more than one odoo worker though because I am on a trial project. I don't see why the lock wouldn't work though. |
Hello I am facing the same problem as @lilee115 but not on odoo.sh Job runner only runs once and after that jobs remains pending. |
Do you have logs using the debug level? |
Seems that problem only appears on a given server, not elsewhere. So it should be a wrong parameters setting. |
This would be a great built-in feature for HA support! 2 remarks:
|
Very good point!
What's your idea behind the custom lock name? Is it to complement #267? If so, releasing and taking a new lock with the new list of database wouldn't be better? Or could a custom lock name be better for another reason? |
e0d49e6
to
1959d48
Compare
I added cr.execute("SET idle_in_transaction_session_timeout = 60000;") before taking the lock. This comment:
Could be part of a follow-up pull request. |
Hi @guewen I did some tests. We need to keep a transaction open if we want to leverage With these changes, I confirm that these scenarios are supported:
My only concern in the context of odoo.sh is that in the absence of http trafic, all http workers will be stopped and thus all jobrunners. If you have a scheduled action that creates queue jobs, they will stay pending until the next http request (that will trigger the start of an http worker). One workaround is to use an external service (e.g. pingdom or a custom prometheus) to request the instance periodically. |
Excellent! Many thanks @nilshamerlinck for the test, fix and details!
Yes, I suppose you are right. Same story for jobs with an ETA in the future. |
Can't you use an odoo cron for that?
|
Hi @yelizariev
A cron could help to keep workers warm indeed, but:
|
[14.0] Support multi-nodes with lock on jobrunner (port of OCA#256)
Hi guys, is this good to run on a production instance on Odoo.sh? |
Sorry to hijack the thread again but unfortunately this solution will still cause issues on Odoo.sh. |
self.initialize_databases() | ||
_logger.info("database connections ready") | ||
|
||
last_keep_alive = None | ||
|
||
# inner loop does the normal processing | ||
while not self._stop: | ||
self.process_notifications() | ||
self.run_jobs() | ||
self.wait_notification() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure what the exact problem is that @amigrave has with locking vs. replication, but if it is the fact that the advisory lock is held for long - maybe a solution can be to always try to re-acquire the lock at the beginning of this loop, and let it go at the end? Then the lock will not be held so long.
If, by chance, another instance of the job runner is 'first' to acquire the lock, it will just change the master job runner to that one, but there will still only be one running at a time.
* PostgreSQL advisory locks are based on a integer, the list of database names | ||
is sorted, hashed and converted to an int64, so we lose information in the | ||
identifier. A low risk of collision is possible. If it happens some day, we | ||
should add an option for a custom lock identifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be a solution to try to lock the channel
records using:
SELECT name FROM queue_job_channel FOR UPDATE;
instead of going for the integer lock?
That could solve both above issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long locks on a table will lead to vacuum issues (and probably replication as well)
* If 2 job runners have a database in common but a different list (e.g. | ||
``db_name=project1,project2`` and ``db_name=project2,project3``), both job | ||
runners will work and listen to ``project2``, which will lead to unexpected | ||
behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps hold a lock for each of the database names separately?
@guewen @thomaspaulb any update on this? Just wondering how far is this PR from being "odoo.sh ready" (if possible at all). Soon we'll have to provide odoo.sh compat for some connector modules which likely will lead us to remove the dependency on CC @sebalix |
And #256 (comment) may be worth trying. |
There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days. |
Hi @amigrave, I know this thread is quite old but there have been some new developments for an HA deployment of queue_job that is also compatible with odoo.sh. This new approach is being discussed here #607 We are now evaluating this solution vs session level advisory locks but the last thing you mentioned in this PR was that advisory locks caused an issue to the db replication. If you could find a few minutes to give us some feedback there so we can ensure this will also be compatible with odoo.sh we would greatly appreciate it. |
Starting several odoo (main) processes with "--load=web,queue_job"
was unsupported, as it would start several jobrunner, which would all
listen to postgresql notifications and try to enqueue jobs in concurrent
workers.
This is an issue in several cases:
and starts several jobrunners (How to set up queue_job in odoo sh? #169 (comment))
available in case of failure of a node/host
The solution implemented here is using a PostgreSQL advisory lock,
at session level in a connection on the "postgres" database, which
ensure 2 job runners are not working on the same set of databases.
At loading, the job runner tries to acquire the lock. If it can, it
initializes the connection and listen for jobs. If the lock is taken
by another job runner, it waits and retry to acquire it every 30
seconds.
Example when a job runner is started and another one starts:
The shared lock identifier is computed based on the set of databases
the job runner has to listen to: if a job runner is started with
--database=queue1
and another with--database=queue2
, they willhave different locks and such will be able to work in parallel.
Important: new databases need a restart of the job runner. This was
already the case, and would be a great improvement, but is out of
scope for this improvement.