Skip to content

No good tools for long streams of tasksΒ #493

Open
@chrisseaton

Description

@chrisseaton

A couple of times now I've been speaking to someone and they have a problem like this.

pool = Concurrent::FixedThreadPool.new(Concurrent.processor_count)

1.upto 2000000 do |i|
  pool.post do
    # work here
  end
end

or

File.open('foo').each do |line|
  pool.post do
    # work here
  end
end

In both cases we're creating a huge number of tasks, and in the latter case we may not know how many tasks in advance.

The problem in both cases is that people create millions of tasks which can take up a lot of memory with the proc and the closure.

I feel like we're missing two abstractions here.

The first is a basic parallel #each on a Enumerable with a length. We don't have that do we? It would need chunking (run n tasks in each task), and perhaps automatic chunking based on profiling (run 1 task, see how long it takes, think about how many tasks there are and set n based on that).

The second is something similar that works on an Enumerator, which doesn't have a length. Here chunking is harder as we don't know how many tasks there will be in advance. We may need some kind of work stealing here.

I'd like to write the two examples above as:

1.upto(2000000).parallel_each(pool) do
  # work here
end

or

File.open('foo').each.parallel_each(pool) do |line|
  # work here
end

In both cases we'd only create as many tasks at a time as was reasonable (if the pool has n threads it may be kn tasks for some small constant k).

A workaround in the mean time to just stop so many tasks being created and memory being blown may be to do this (ping @digininja this is relevant to you):

pool = Concurrent::FixedThreadPool.new(
  Concurrent.processor_count,
  max_queue: 10 * Concurrent.processor_count,
  fallback_policy: :caller_runs)

1.upto 2000000 do |i|
  pool.post do
    # work here
  end
end

This will only create up to 10 times as many tasks as you have cores, with any other tasks being run immediately instead of being added to the pool. This means if you already have say 40 tasks in the pool, instead of creating a new task it will be run, and then by the time the loop gets around to finishing that the pool may be ready for new tasks, or may not be in which case the main thread runs that new task as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAdding features, adding tests, improving documentation.looking-for-contributorWe are looking for a contributor to help with this issue.medium-priorityShould be done soon.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions