-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to add 2 nodes simultaneously to closure_tree table results in a deadlock #240
Comments
I have just discovered a similar problem in production on Postgresql. Have not been able to track down what is causing it, but I discovered an infinite loop of
in my log. Database is unresponsive until I restart the app. closure_tree version: 6.0.0 |
Yikes. Thanks for reporting. Can you see if reverting #234 fixes the issue? That's the most recent update to locks. |
Reverting that commit doesn't seem to fix this. |
Same here. I had to revert back to version 6.0 in production, but the problem still appears though much less frequent with version 6.0 then 6.2. I'm trying to find time to write tests to try and reproduce this problem locally. |
Can confirm, this issue has been around for a while. 6.2 seems to have increased the frequency from a monthly to hourly occurrence. |
Are all you using postgres? What version? |
FWIW, I've been looking through the diffs between 6.0.0 and 6.0.2. Nothing immediately nefarious jumps out. v6.0.0...master |
We're using: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.9.1-16ubuntu6) 4.9.1, 64-bit |
I am killing the lock by hand. |
Thanks for the additional info. @DuaneCompton did you have 1 lockup/10k inserts with v6.1? |
@mceachen There are 6 threads running that are simultaneously under load. under v6.1 I am averaging 1 lockup/10k inserts. In v6.2 it locks basically immediately. |
We use MySql, therefore I don't think this is a postgres specific issue. |
@mceachen Anything I can do from a bug gathering standpoint to help troubleshoot this issue? We still run into a lock up on a weekly basis on 6.0. I am probably going to write a little script to free the lock if the table locks for longer than N minutes. But, before I do so is there anyway I can help? |
Can you capture the current running queries? Without a test to replicate
the issue on demand, any change we make will be a random shot in the dark.
|
@mceachen will do. |
Has anyone had any more luck in tracking down this issue ? I've tried to provoke this issue by creating multiple records in parallell in separate threads as was mentioned initially by @amitsaxena. But no luck so far. I've forked off 6 threads each creating a 1000 records. Tried on both 6.2 and 6.0. |
@tomaadland are you unable to reproduce the issue? I was consistently getting the bug as soon as I execute the code above. |
@amitsaxena as soon as I have some time on my hands i'll create a public project with an example model similar to one I have in production with the tests. I'll post it here to let you know. |
@mceachen So, I've noticed when we hit this state there is really just one lock that is hanging around. Is there any reason you could see that we couldn't use the timeout in your with_advisory_lock gem to at least an infinite lock? |
I was also curious about this line: https://github.com/mceachen/closure_tree/blob/master/lib/closure_tree/hierarchy_maintenance.rb#L82
`with_advisory_lock` is smart enough to no-op if a currently-acquired lock is requested again: See https://github.com/mceachen/with_advisory_lock/blob/master/lib/with_advisory_lock/base.rb#L42
If we auto-expire the rebuild lock, the rebuild isn't guaranteed to be consistent.
The deadlock is either due to a bug in the RDBMS, or, much more likely, a
circular dependency due to transaction boundaries.
For example, if one of your data mutation code paths gets lock A then lock
B, and another gets B then A, you'll regularly see deadlocks.
(edited to add link to nested lock no-op)
|
So, its not a dead lock per se. If I move around the locking obtain/release in just the critical sections of the write/delete/children call then it will be detected as a deadlock. What is actually happening is: |
Can you maybe point to the code, or talk more about what's going on in writer 1 and 2? Deadlocks I've seen in the past occur when there are multiple code paths that have transaction blocks that edit models in different order. |
Sorry, its not a deadlock in the traditional sense, its actually worse since a deadlock will be detected and aborted. In this case, the first process has obtained the lock and is proceeding on its way when the second process comes in and attempts to obtain a lock. It fails since writer one already has the a lock (which is the correct behavior IMO). What I am having trouble understanding is why writer 1 hangs until writer 2 gives up attempting to obtain its lock. If I run something like: parent = Tag.create(name: 'lock 1')
(1..100).each do |i|
p "adding #{i}"
new_node = Tag.create(:name => "test deadlock #{i}")
ActiveRecord::Base.transaction do
parent = parent.add_child(new_node)
end
end on two separate machines upon starting the second both will shortly arrive in a semi-deadlocked situation. The script that I started the call on will then eventually expire and the first writer will continue. I've tried moving around the transaction/without it. Below is one potential outcome. Writer 1 :
Writer 2
|
And by 👍 I mean 😭. I'll see if I can wire up a test to do this. |
Thank you. I will see what I can do as well. I do not believe at this time that its an a/b b/a situation, because I think the lock acquisition time is much higher than the deadlock check timer on the db. I'm wondering if maybe calling a complete table lock might be better in the rebuild! function rather than calling an additional lock for each delete/child insert (maybe I am misunderstanding how those locks work though). |
I think a table lock (which I had dismissed as being too coarse a mutex) is an interesting alternative. I'll code up that as an option. |
Have almost the same problem(after upgrading to Rails 5), with Rails 4.2.5 it worked well I have And after saving the parent item I got deadlock Tried versions |
I thinks it's more about Rails update than mysql Ver 14.14 Distrib 5.5.44 I am playing with it, |
@mceachen It appears the issue I am running into is specifically (and I am guessing amitsaxena as well) around the numeric reordering (https://github.com/mceachen/closure_tree/blob/master/lib/closure_tree/hierarchy_maintenance.rb#L76) while writer 2 will attempt to write into: https://github.com/mceachen/closure_tree/blob/master/lib/closure_tree/numeric_order_support.rb#L39 If you call directly children.create() it will avoid this issue, though I would guess any concurrent movement of leafs with numeric ordering enabled could result in this. If I replace the with_advisory_lock from support.rb with just a model_class.lock(true) the deadlock appears to go away. I did also note that if you attempt a destroy_all on a large tag set > 10000 (with our without numeric ordering) Hope this is helpful. |
@DuaneCompton that is super helpful, and shouldn't be hard to write a breaking test. Do you want to have a go? |
I will give it a shot. Thanks. |
@DuaneCompton any progress here? I just ran into this issue. |
Thanks to @serenaf, she highlighted the MySQL docs that reveal v5.7.5 fundamentally changed how |
@uhlenbrock this issue (which seems to be |
@mceachen I have started to take a look. I have two PRs open on |
@uhlenbrock Any ideas on how to replicate this in a test ? |
@tomaadland I have started work over here: ClosureTree/with_advisory_lock#17 |
I am working to build a failing test based on the suggestion above: describe Tag do
context 'in a race condition' do
it 'should not break the database' do
parent1 = Tag.create!(name: 'parent tag 1')
parent2 = Tag.create!(name: 'parent tag 2')
Thread.new do
new_node = Tag.new(name: 'test deadlock 1')
parent1.add_child(new_node)
end
Thread.new do
new_node = Tag.new(name: 'test deadlock 1')
parent2.add_child(new_node)
end
end
end
end Via
I believe this is because of Does this appear to adequately reproduce the issue? |
I tried running this test in the gem and I get the When I created a similar unit test in the app that is using this gem and when calling
So in my case the
causes a deadlock. |
I may have found a workaround that works for me -- though at present I'm not sure of what other consequences may arise from my solution. I'd long since upgraded the database to MariaDB 10.x from some older MySQL without apparent incident. Until one day when I needed to reparent something, and then...
At the time, i wrote an awful script that monkey-patched the part of Now, however, I'm going to need to reparent more often, so I took some time today to do a little more research with a simple test script... #!/usr/bin/ruby
$LOAD_PATH.unshift '/idw/lib'
Dir.chdir('/idw') do
require 'rubygems'
require 'bundler'
Bundler.require(:default) # TODO: split this out to just what's needed
require '/idw/config/environment'
end
require 'pp'
# connect to the right database for the curent environment
IDW::Base.establish_connection(YAML.load(File.read('/idw/config/database.yml'))[ENV['IDW_ENV']])
IDW::Base.connection_pool.with_connection do
hc = IDW::Hostclass.find_by(name: 'NS_OPMS')
pp hc.self_and_ancestor_ids
phc = IDW::Hostclass.find_by(name: 'NS_ORCAWAVE')
pp phc.self_and_ancestor_ids
phc.children << hc
pp hc.self_and_ancestor_ids
end Which, when run... Failed:
...but provided some clues as to what was going on.
The (currently) third solution to https://stackoverflow.com/questions/6000336/how-to-debug-lock-wait-timeout-exceeded-on-mysql mentions that InnoDB's default isolation level is REPEATABLE_READ, whereas other DBs use READ_COMMITTED. Setting my database's default isolation level to READ_COMMITTED... SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED; ...did the trick and allowed the reparenting script (and another more complicated script) to work as anticipated. Not sure I'd want to leave this on globally, might be best to set it as tightly around the code as possible (before that first transaction). Maybe something like... FooModel.transaction(isolation: :read_committed) do
# the reparenting stuff
end I'm not confident this solves the other two-thread deadlock issue, as that appears to be happening with non-MySQL/MariaDB databases, also. |
I hadn't thought about this before @tomalok 's post, but I believe |
we are facing the same issue. Any update on this? |
I can reliably reproduce this deadlock with the following test case, which concurrently removes children from a nested set with numeric sort order. The ordering seems to be the culprit, because it tries to update siblings' sort indices which leads to an implicit row lock, which, in conjunction with the implicit row locks taken out when rails updates the parent_id on records (which happens before the advisory_lock is attempted). So a possible scenario would be:
This kind of deadlock where one writer waits for the advisory lock while the other waits for a row level lock apparently isn't detected by either MySQL or Postgres, leading to an endless wait. The only way I found to fix this is to get the advisory lock before even updating the parent_id, i.e. by calling test case: it "fails to deadlock from removing siblings with sort_order" do
skip("unsupported") unless run_parallel_tests?
@root = Label.create!(name: 'root')
60.times{ |i| Label.create! name: i, mother_id: @root.id }
@root.reload
# remove 40 random children in parallel threads
descendant_ids = @root.descendant_ids
ids_to_remove = descendant_ids.sample(40).shuffle
ids_to_keep = descendant_ids - ids_to_remove
Parallel.map(ids_to_remove, in_threads: max_threads) do |id|
ActiveRecord::Base.connection_pool.with_connection do
Label.find(id).update(mother_id: nil)
end
end
Label.where(id: ids_to_remove).each do |l|
assert_nil l.mother_id
assert l.root?
assert l.leaf?
end
@root.reload
assert_equal 20, @root.children.count
assert order = @root.children.pluck(:column_whereby_ordering_is_inferred).uniq
assert_equal (0..19).to_a, order
end The test will get stuck in an endless loop, unless the call to |
When two nodes are added simultaneously to the closure tree table, this results in a deadlock and the query to acquire lock runs infinitely:
You can reproduce the problem easily by using the code below (where List is the closure_tree model):
We faced this problem on production, and the infinite locking loop results in mysql being unresponsive, which in turn results in unresponsive app.
closure_tree version: 6.2.0
ruby: 2.3.0
rails: 5.0.0.1
The text was updated successfully, but these errors were encountered: