Description
A couple of times now I've been speaking to someone and they have a problem like this.
pool = Concurrent::FixedThreadPool.new(Concurrent.processor_count)
1.upto 2000000 do |i|
pool.post do
# work here
end
end
or
File.open('foo').each do |line|
pool.post do
# work here
end
end
In both cases we're creating a huge number of tasks, and in the latter case we may not know how many tasks in advance.
The problem in both cases is that people create millions of tasks which can take up a lot of memory with the proc and the closure.
I feel like we're missing two abstractions here.
The first is a basic parallel #each
on a Enumerable
with a length
. We don't have that do we? It would need chunking (run n
tasks in each task), and perhaps automatic chunking based on profiling (run 1 task, see how long it takes, think about how many tasks there are and set n
based on that).
The second is something similar that works on an Enumerator
, which doesn't have a length. Here chunking is harder as we don't know how many tasks there will be in advance. We may need some kind of work stealing here.
I'd like to write the two examples above as:
1.upto(2000000).parallel_each(pool) do
# work here
end
or
File.open('foo').each.parallel_each(pool) do |line|
# work here
end
In both cases we'd only create as many tasks at a time as was reasonable (if the pool has n
threads it may be kn
tasks for some small constant k
).
A workaround in the mean time to just stop so many tasks being created and memory being blown may be to do this (ping @digininja this is relevant to you):
pool = Concurrent::FixedThreadPool.new(
Concurrent.processor_count,
max_queue: 10 * Concurrent.processor_count,
fallback_policy: :caller_runs)
1.upto 2000000 do |i|
pool.post do
# work here
end
end
This will only create up to 10 times as many tasks as you have cores, with any other tasks being run immediately instead of being added to the pool. This means if you already have say 40 tasks in the pool, instead of creating a new task it will be run, and then by the time the loop gets around to finishing that the pool may be ready for new tasks, or may not be in which case the main thread runs that new task as well.