Skip to content

Conversation

@ajeans
Copy link
Contributor

@ajeans ajeans commented Oct 11, 2025

Replace old format Map<Long, Long> with new format Map<Long, Set> to support configurable concurrent jobs per project.

One-time Redis Migration automatically runs on startup and performs one-time data conversion.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Configure how many batch jobs can run in parallel per project.
    • Project locks can track multiple concurrent jobs and return detailed info for each.
    • Lock status now treats an empty set as unlocked and existing locks are migrated automatically.
  • Bug Fixes

    • Missing locked jobs now emit a warning when job info is unavailable.
  • Tests

    • Tests converted to parameterized concurrency checks and updated to verify collection-based lock state.

Replace old format Map<Long, Long> with new format Map<Long, Set<Long>>
to support configurable concurrent jobs per project.

One-time Redis Migration automatically runs on startup and
performs one-time data conversion.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 11, 2025

Walkthrough

Project batch locking changed from a single-job lock per project to a set of job IDs with configurable per-project concurrency. Public API models, locking manager, startup migration, and tests were updated to use Set for locks, fetch multiple JobInfo entries, and migrate legacy Redis entries.

Changes

Cohort / File(s) Summary
API: Project lock model and status
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/administration/ProjectBatchLockController.kt
ProjectLockModel now exposes lockedJobIds: Set<Long> and jobInfos: List<JobInfo>; lock status derived from set emptiness; fetches JobInfo for all locked IDs and logs warnings for missing jobs.
Locking core & migration
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt
Lock map changed to ConcurrentMap<Long, Set<Long>>; methods updated to work with sets (getMap, getLockedForProject, getLockedJobIds, unlockJobForProject); added InitializingBean.afterPropertiesSet migration from legacy Long? entries; added REDIS_PROJECT_BATCH_JOB_LOCKS_KEY and logging updates; constructor now accepts BatchProperties.
Config: concurrency property
backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt
Added projectConcurrency: Int property controlling parallel jobs per project across instances.
Tests & utilities: adapt and parameterize
backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt
backend/app/src/test/kotlin/io/tolgee/util/BatchDumper.kt
Tests updated to expect Set<Long> (use .isEmpty()), added injections for BatchProperties and BatchJobProjectLockingManager, parameterized tests over projectConcurrency (values [1,2]), added cancellation test for concurrency=2, and added finallyDumpAll/getAllJobs utilities.
Docs/comment
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt
Comment updated to describe configurable per-project concurrency; no functional change.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Controller as ProjectBatchLockController
  participant Manager as BatchJobProjectLockingManager
  participant Store as Redis/Local Map

  rect rgba(230,245,255,0.6)
  note over Manager: Startup migration (afterPropertiesSet)
  Manager->>Store: Read legacy Long? entries
  alt legacy format present
    Manager->>Store: Write converted Set<Long> entries
  else
    note over Manager: Already set-based
  end
  end

  Client->>Controller: GET /locks
  Controller->>Manager: getMap()/getLockedForProject()
  Manager->>Store: Read Set<Long> per project
  Store-->>Manager: Set<Long>
  Manager-->>Controller: Set<Long> per project
  Controller->>Manager: fetch JobInfo for each jobId
  Manager-->>Controller: List<JobInfo> (filtered)
  Controller-->>Client: ProjectLockModel{lockedJobIds, jobInfos, status}

  rect rgba(235,255,235,0.6)
  note over Client,Manager: Lock attempt
  Client->>Manager: lock(projectId, jobId)
  Manager->>Store: Atomic compute: read Set<Long>
  alt size < projectConcurrency
    Store-->>Manager: Updated Set + jobId
    Manager-->>Client: Locked
  else
    Manager-->>Client: Rejected (at capacity)
  end
  end

  rect rgba(255,240,240,0.6)
  note over Client,Manager: Unlock
  Client->>Manager: unlock(projectId, jobId)
  Manager->>Store: Atomic compute: remove jobId from Set<Long>
  Store-->>Manager: Updated Set<Long>
  Manager-->>Client: Unlocked/No-op
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • fix: batch job processing #3265 — touches BatchJobProjectLockingManager and initial job handling; likely overlaps with the set-based lock conversion and migration.

Suggested labels

enhancement

Suggested reviewers

  • JanCizmar

Poem

I hop through locks now counted by twos,
From single hops to a flock of queued views.
Carrots for concurrency, neat rows to tend,
Old burrows migrated, new paths to mend.
A rabbit cheers each job that can run—hooray! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "feat: implement project locks containing multiple job ids" directly and accurately reflects the core functionality being introduced in this changeset. The primary change across all modified files is the replacement of single job ID locks (lockedJobId: Long?) with multiple job ID locks (lockedJobIds: Set<Long>), along with supporting infrastructure like the new projectConcurrency configuration property. The title is concise, specific, and clearly communicates the main objective without vague terms or unnecessary detail. A teammate scanning the pull request history would immediately understand that this PR introduces multi-job locking capability at the project level.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (2)

307-309: Fix assertJobUnlocked to set semantics.

Still asserts 0L; should assert empty set.

   fun assertJobUnlocked() {
-    batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEqualTo(0L)
+    batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEmpty()
   }

385-387: Fix verifyJobLocked to set semantics.

Assert that the set contains the job ID (not equals).

   fun verifyJobLocked(job: BatchJob) {
-    batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEqualTo(job.id)
+    batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.contains(job.id)
   }
🧹 Nitpick comments (8)
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (1)

172-181: Update log message to reflect per-project concurrency.

Message still says “Other job ... running”. With n>1, prefer “Project concurrency limit reached; re-queuing”.

-      logger.debug(
-        "⚠️ Cannot run execution ${executionItem.chunkExecutionId}. " +
-          "Other job from the project is currently running, skipping",
-      )
+      logger.debug(
+        "⚠️ Cannot run execution ${executionItem.chunkExecutionId}. " +
+          "Project concurrency limit reached; re-queuing",
+      )
backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt (1)

15-16: Validate projectConcurrency ≥ 1.

Prevent 0/negative values that would deadlock locking logic.

 package io.tolgee.configuration.tolgee
 
 import io.tolgee.configuration.annotations.DocProperty
 import org.springframework.boot.context.properties.ConfigurationProperties
+import org.springframework.validation.annotation.Validated
+import jakarta.validation.constraints.Min
 
 @ConfigurationProperties(prefix = "tolgee.batch")
 @DocProperty(description = "Configuration of batch operations.", displayName = "Batch operations")
-class BatchProperties {
+@Validated
+class BatchProperties {
   @DocProperty(description = "How many parallel jobs can be run at once on single Tolgee instance")
   var concurrency: Int = 1
 
   @DocProperty(description = "How many job chunks are added to the internal queue on each scheduled run")
   var chunkQueuePopulationSize: Int = 1_000
 
   @DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
-  var projectConcurrency: Int = 1
+  @field:Min(1)
+  var projectConcurrency: Int = 1
 }
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/administration/ProjectBatchLockController.kt (1)

49-77: Confirm whether unlocked projects should be returned.

If the lock map retains empty sets, this endpoint will list projects with UNLOCKED status. If not desired, filter out empty sets before mapping.

-    val lockModels = locks.entries.map { (projectId, lockedJobIds) ->
+    val lockModels = locks.entries
+      .filter { it.value.isNotEmpty() }
+      .map { (projectId, lockedJobIds) ->
       val lockStatus = when {
         lockedJobIds.isEmpty() -> LockStatus.UNLOCKED
         else -> LockStatus.LOCKED
       }
       ...
     }
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (5)

18-21: Update class KDoc to reflect multi-job concurrency.

Doc still says “Only single job…”. Update to avoid confusion.


58-68: Remove key when last job unlocks to prevent map bloat and API noise.

Returning empty sets keeps stale entries and may surface UNLOCKED projects via API.

-    getMap().compute(projectId) { _, lockedJobIds ->
+    getMap().compute(projectId) { _, lockedJobIds ->
       logger.debug("Unlocking job: $jobId for project $projectId")
       val currentJobs = lockedJobIds ?: emptySet()
       if (currentJobs.contains(jobId)) {
         logger.debug("Unlocking job: $jobId for project $projectId")
-        val updatedJobs = currentJobs - jobId
-        return@compute updatedJobs.ifEmpty { emptySet() }
+        val updatedJobs = currentJobs - jobId
+        return@compute if (updatedJobs.isEmpty()) null else updatedJobs
       }
       logger.debug("Job: $jobId for project $projectId is not locked")
-      return@compute currentJobs
+      return@compute currentJobs
     }

Note: Returning null removes the entry (valid for ConcurrentMap.compute and Redisson RMap).


128-148: Honor projectConcurrency also when no initial job is found.

When initialJobId is null, the code locks toLock.id unconditionally. If misconfigured to 0, this bypasses the limit.

-      if (initialJobId == null) {
-        logger.debug("No initial job found, locking only ${toLock.id}")
-        return setOf(toLock.id)
-      }
+      if (initialJobId == null) {
+        logger.debug("No initial job found")
+        return if (batchProperties.projectConcurrency > 0) setOf(toLock.id) else emptySet()
+      }

Also consider validating projectConcurrency ≥ 1 in BatchProperties (see related comment).


189-191: Specify generics on Redisson map for type safety.

Minor, but explicit generics improve readability and static checks.

-  private fun getRedissonProjectLocks(): ConcurrentMap<Long, Set<Long>> {
-      return redissonClient.getMap(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
-    }
+  private fun getRedissonProjectLocks(): ConcurrentMap<Long, Set<Long>> {
+    return redissonClient.getMap<Long, Set<Long>>(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
+  }

117-158: Avoid DB calls inside distributed compute when possible.

getInitialJobId() runs under RMap.compute lock; can block other lockers. Cache the initial state or probe outside compute and pass it in, only falling back if entry still empty.

Would you like a patch that moves the initial-state probe outside compute with a CAS-style retry?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 824e968 and c6c2452.

📒 Files selected for processing (5)
  • backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/administration/ProjectBatchLockController.kt (4 hunks)
  • backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (1 hunks)
  • backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (1 hunks)
  • backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (5 hunks)
  • backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt (1 hunks)

Comment on lines +193 to 227
override fun afterPropertiesSet() {
// This runs first to check if redis has a map of the old format.
// If so, we migrate it to the new format.
if (!usingRedisProvider.areWeUsingRedis) {
logger.debug("Not using Redis, skipping migration check")
return
}

val redisProjectBatchJobLocks = redissonClient.getMap<Long, Any>(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
val isOldFormat = redisProjectBatchJobLocks.values.any { it is Long || it == null }
if (!isOldFormat) {
logger.debug("Redis project locks are in new format, no migration needed")
return
}

logger.info("Starting migration of project locks from old format (v1) to new format (v2)")
// First, copy all data from Redis to local memory
val localCopy = mutableMapOf<Long, Set<Long>>()
redisProjectBatchJobLocks.forEach { (projectId, jobId) ->
val jobSet = when (jobId) {
null, 0L -> emptySet<Long>()
else -> setOf<Long>(jobId as Long)
}
localCopy[projectId] = jobSet
}
logger.info("Copied ${localCopy.size} project locks from old format to local memory")

// Write all data back in new format (this will overwrite the old format)
val newMap = getRedissonProjectLocks()
localCopy.forEach { (projectId, jobSet) ->
newMap[projectId] = jobSet
}

logger.info("Successfully migrated ${newMap.size} project locks from local memory to new format")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Migration can throw ClassCastException on mixed formats; handle both Long and Set values.

Current code assumes all values are Long/null when isOldFormat is true; mixed maps will fail.

   override fun afterPropertiesSet() {
     // This runs first to check if redis has a map of the old format.
     // If so, we migrate it to the new format.
     if (!usingRedisProvider.areWeUsingRedis) {
       logger.debug("Not using Redis, skipping migration check")
       return
     }
 
-    val redisProjectBatchJobLocks = redissonClient.getMap<Long, Any>(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
-    val isOldFormat = redisProjectBatchJobLocks.values.any { it is Long || it == null }
-    if (!isOldFormat) {
-      logger.debug("Redis project locks are in new format, no migration needed")
-      return
-    }
+    val redisProjectBatchJobLocks = redissonClient.getMap<Long, Any>(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
+    val containsOnlyNew = redisProjectBatchJobLocks.values.all { it is Set<*> }
+    if (containsOnlyNew) {
+      logger.debug("Redis project locks are in new format, no migration needed")
+      return
+    }
 
     logger.info("Starting migration of project locks from old format (v1) to new format (v2)")
-    // First, copy all data from Redis to local memory
-    val localCopy = mutableMapOf<Long, Set<Long>>()
-    redisProjectBatchJobLocks.forEach { (projectId, jobId) ->
-      val jobSet = when (jobId) {
-        null, 0L -> emptySet<Long>()
-        else -> setOf<Long>(jobId as Long)
-      }
-      localCopy[projectId] = jobSet
-    }
-    logger.info("Copied ${localCopy.size} project locks from old format to local memory")
+    val localCopy = mutableMapOf<Long, Set<Long>>()
+    redisProjectBatchJobLocks.forEach { (projectId, value) ->
+      val jobSet: Set<Long> = when (value) {
+        null -> emptySet()
+        is Long -> if (value == 0L) emptySet() else setOf(value)
+        is Iterable<*> -> value.filterIsInstance<Long>().toSet()
+        else -> {
+          logger.warn("Unknown lock value type ${value::class.java} for project $projectId; treating as empty")
+          emptySet()
+        }
+      }
+      localCopy[projectId] = jobSet
+    }
+    logger.info("Prepared ${localCopy.size} project locks for new format")
 
     // Write all data back in new format (this will overwrite the old format)
     val newMap = getRedissonProjectLocks()
     localCopy.forEach { (projectId, jobSet) ->
       newMap[projectId] = jobSet
     }
 
     logger.info("Successfully migrated ${newMap.size} project locks from local memory to new format")
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
override fun afterPropertiesSet() {
// This runs first to check if redis has a map of the old format.
// If so, we migrate it to the new format.
if (!usingRedisProvider.areWeUsingRedis) {
logger.debug("Not using Redis, skipping migration check")
return
}
val redisProjectBatchJobLocks = redissonClient.getMap<Long, Any>(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
val isOldFormat = redisProjectBatchJobLocks.values.any { it is Long || it == null }
if (!isOldFormat) {
logger.debug("Redis project locks are in new format, no migration needed")
return
}
logger.info("Starting migration of project locks from old format (v1) to new format (v2)")
// First, copy all data from Redis to local memory
val localCopy = mutableMapOf<Long, Set<Long>>()
redisProjectBatchJobLocks.forEach { (projectId, jobId) ->
val jobSet = when (jobId) {
null, 0L -> emptySet<Long>()
else -> setOf<Long>(jobId as Long)
}
localCopy[projectId] = jobSet
}
logger.info("Copied ${localCopy.size} project locks from old format to local memory")
// Write all data back in new format (this will overwrite the old format)
val newMap = getRedissonProjectLocks()
localCopy.forEach { (projectId, jobSet) ->
newMap[projectId] = jobSet
}
logger.info("Successfully migrated ${newMap.size} project locks from local memory to new format")
}
override fun afterPropertiesSet() {
// This runs first to check if redis has a map of the old format.
// If so, we migrate it to the new format.
if (!usingRedisProvider.areWeUsingRedis) {
logger.debug("Not using Redis, skipping migration check")
return
}
val redisProjectBatchJobLocks = redissonClient.getMap<Long, Any>(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
val containsOnlyNew = redisProjectBatchJobLocks.values.all { it is Set<*> }
if (containsOnlyNew) {
logger.debug("Redis project locks are in new format, no migration needed")
return
}
logger.info("Starting migration of project locks from old format (v1) to new format (v2)")
val localCopy = mutableMapOf<Long, Set<Long>>()
redisProjectBatchJobLocks.forEach { (projectId, value) ->
val jobSet: Set<Long> = when (value) {
null -> emptySet()
is Long -> if (value == 0L) emptySet() else setOf(value)
is Iterable<*> -> value.filterIsInstance<Long>().toSet()
else -> {
logger.warn("Unknown lock value type ${value::class.java} for project $projectId; treating as empty")
emptySet()
}
}
localCopy[projectId] = jobSet
}
logger.info("Prepared ${localCopy.size} project locks for new format")
// Write all data back in new format (this will overwrite the old format)
val newMap = getRedissonProjectLocks()
localCopy.forEach { (projectId, jobSet) ->
newMap[projectId] = jobSet
}
logger.info("Successfully migrated ${newMap.size} project locks from local memory to new format")
}

@bdshadow bdshadow self-requested a review October 11, 2025 19:52
Copy link
Contributor

@bdshadow bdshadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also did you check if the cancelation works properly?

Could you please check the BatchJobManagementControllerTest and the BatchJobManagementControllerCancellationTest if the scenarios tested there are not broken by allowing multiple jobs per project. Please add separate tests for some of them but with projectConcurrency=2.

return redissonClient.getMap(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
}

override fun afterPropertiesSet() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe the migration is not needed, because io.tolgee.component.CacheCleaner clears all caches on ApplicationReadyEvent

Copy link
Contributor Author

@ajeans ajeans Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about io.tolgee.component.CacheCleaner

Looking at it, it enumerates caches with io.tolgee.constants.Caches#caches
I don't see the project batch locks in there.

And theoretically, it would not make sense to synchronize n instances with a (key, value) in Redis (project_batch_job_locks), and then drop it any time any instance starts.

In our production and qa, we've not seen the locks disappear on reboots.
We had to add scripts to clear the locks quite often when we debugged the batch problems.

The migration of data in Redis is possibly the biggest complexity in this PR, but I didn't find a way that was less intrusive. My first takes on this were to transparently handle Long? and Set data in Redis and map them to Set magically.
This proved very intrusive in a lot of methods (reading and writing), and because of type erasure, forced me to duplicate many methods.

This how I ended up a with redis value migration on start. This requires a reboot of all the instances, as the ones with version n-1 don't like the new redis value format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajeans i agree with your points.

@JanCizmar please, take a look here too. You're more familar with the code base and you once pointed me to the CacheCleaner. Maybe there is smth else we're missing.

Also @JanCizmar , please, also comment about the restart of all instances at once. ArgoCD does it one by one, but it's of course possible to make it stop all pods by manually helping with replcas: 0.

@ajeans
Copy link
Contributor Author

ajeans commented Oct 15, 2025

@bdshadow I took care of 3 comments, and replied on the big comment regarding redis data migration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (1)

186-220: Handle mixed old/new format values to prevent ClassCastException.

The migration logic assumes all values are either old format (Long/null) or all new format (Set). However, line 195 checks if any value is old format, then line 207 casts all values to Long. If Redis contains a mix of formats, this will throw ClassCastException.

This issue was previously identified by coderabbitai[bot]. Apply the suggested fix to handle all value types (null, Long, Set, and unknown) gracefully during migration.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c6c2452 and a30479a.

📒 Files selected for processing (1)
  • backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (5 hunks)
🔇 Additional comments (2)
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (2)

5-5: LGTM! Architecture changes support the new concurrent job requirements.

The addition of BatchProperties for configurable concurrency and InitializingBean for startup migration are appropriate for this feature.

Also applies to: 9-9, 15-15, 25-25, 29-29


58-67: LGTM! Set-based lock operations are correctly implemented.

The unlock logic properly removes individual job IDs from the set, and the lock methods correctly use set-based computations with the new format. The getLockedJobIds() flatten operation correctly collects all job IDs across projects.

Also applies to: 70-98, 100-114, 222-224

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a30479a and 213107c.

📒 Files selected for processing (2)
  • backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt (4 hunks)
  • backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt (2)
backend/testing/src/main/kotlin/io/tolgee/AbstractSpringTest.kt (1)
  • clearForcedDate (268-270)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/AbstractBatchJobManagementControllerTest.kt (1)
  • saveAndPrepare (76-80)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt (2)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/AbstractBatchJobManagementControllerTest.kt (1)
  • saveAndPrepare (76-80)
backend/testing/src/main/kotlin/io/tolgee/AbstractSpringTest.kt (1)
  • executeInNewTransaction (256-262)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt (2)

98-110: Use AtomicBoolean (or CountDownLatch) for cross-thread wait; avoid visibility races

Local mutable var wait = true is not guaranteed visible across threads. Replace with AtomicBoolean (or a latch) to prevent intermittent hangs.

-    var wait = true
+    val wait = java.util.concurrent.atomic.AtomicBoolean(true)
...
-          while (wait) {
+          while (wait.get()) {
             Thread.sleep(100)
           }
...
-      wait = false
+      wait.set(false)

Add import if needed:

import java.util.concurrent.atomic.AtomicBoolean

Also applies to: 162-164


202-207: Same wait visibility issue in current jobs test

Make wait AtomicBoolean (or use a latch) to avoid data races.

-    var wait = true
+    val wait = java.util.concurrent.atomic.AtomicBoolean(true)
...
-      while (wait) {
+      while (wait.get()) {
         Thread.sleep(100)
       }
...
-      wait = false
+      wait.set(false)

Also applies to: 268-269

♻️ Duplicate comments (3)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt (2)

35-40: Good fix: properly capture and restore projectConcurrency

Caching in @beforeeach and restoring in @AfterEach prevents leakage across tests. Matches the earlier recommendation.

Also applies to: 46-46


231-234: Recheck lock count expectation here as well

Same potential flakiness: assert non‑zero and ≤ projectConcurrency; and that IDs belong to the started jobs.

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (1)

200-226: Redis migration will ClassCast on mixed formats; handle Long/Set/null robustly

Maps can contain both old (Long/null) and new (Set) values. Current cast to Long will crash.

-    val redisProjectBatchJobLocks = redissonClient.getMap<Long, Any>(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
-    val isOldFormat = redisProjectBatchJobLocks.values.any { it is Long || it == null }
-    if (!isOldFormat) {
-      logger.debug("Redis project locks are in new format, no migration needed")
-      return
-    }
+    val redisProjectBatchJobLocks = redissonClient.getMap<Long, Any>(REDIS_PROJECT_BATCH_JOB_LOCKS_KEY)
+    val containsOnlyNew = redisProjectBatchJobLocks.values.all { it is Set<*> }
+    if (containsOnlyNew) {
+      logger.debug("Redis project locks are in new format, no migration needed")
+      return
+    }
@@
-    val localCopy = mutableMapOf<Long, Set<Long>>()
-    redisProjectBatchJobLocks.forEach { (projectId, jobId) ->
-      val jobSet = when (jobId) {
-        null, 0L -> emptySet<Long>()
-        else -> setOf<Long>(jobId as Long)
-      }
-      localCopy[projectId] = jobSet
-    }
-    logger.info("Copied ${localCopy.size} project locks from old format to local memory")
+    val localCopy = mutableMapOf<Long, Set<Long>>()
+    redisProjectBatchJobLocks.forEach { (projectId, value) ->
+      val jobSet: Set<Long> = when (value) {
+        null -> emptySet()
+        is Long -> if (value == 0L) emptySet() else setOf(value)
+        is Iterable<*> -> value.filterIsInstance<Long>().toSet()
+        else -> {
+          logger.warn("Unknown lock value type ${value::class.java} for project $projectId; treating as empty")
+          emptySet()
+        }
+      }
+      localCopy[projectId] = jobSet
+    }
+    logger.info("Prepared ${localCopy.size} project locks for new format")
🧹 Nitpick comments (4)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt (2)

121-124: Avoid println in tests

Use logger or remove to keep CI logs clean.

-          println(
-            "Job ${it.id} status ${it.status} progress: ${state?.values?.sumOf { it.successTargets.size }}",
-          )
+          // optionally log via logger if needed

125-132: Lock count assertion may be timing‑dependent

You assert lockedJobs.size == projectConcurrency while only 1 RUNNING job is expected. Finished jobs typically release locks, so the lock count may drop below N. Consider asserting:

  • lockedJobs.size > 0
  • lockedJobs.size <= projectConcurrency
  • lockedJobs subset of jobs’ IDs

Also applies to: 129-132

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (2)

58-67: Remove mapping when last job unlocks to avoid empty entries

Return null from compute to delete the key when the set becomes empty. Prevents Redis clutter and keeps semantics clean.

-        val updatedJobs = currentJobs - jobId
-        return@compute updatedJobs.ifEmpty { emptySet() }
+        val updatedJobs = currentJobs - jobId
+        return@compute if (updatedJobs.isEmpty()) null else updatedJobs

228-230: getLockedJobIds() creates an aggregated set each call

Fine for diagnostics, but avoid hot paths. If used frequently, consider exposing counts per project or caching metrics instead.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 213107c and 05908af.

📒 Files selected for processing (2)
  • backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt (9 hunks)
  • backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt (2)
backend/testing/src/main/kotlin/io/tolgee/AbstractSpringTest.kt (1)
  • clearForcedDate (268-270)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/AbstractBatchJobManagementControllerTest.kt (1)
  • saveAndPrepare (76-80)
🔇 Additional comments (1)
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (1)

77-91: RMap.compute is atomic per key across clients Redisson runs compute server-side via Lua scripting and per-key locking, so the projectId map compute is safe.

Comment on lines +127 to +147
// nothing is locked
if (lockedJobIds.isEmpty()) {
logger.debug("Getting initial locked state from DB state")
// we have to find out from database if there is any running job for the project
val initial = getInitialJobId(projectId)
logger.debug("Initial locked job $initial for project ${toLock.projectId}")
if (initial == null) {
logger.debug("No job found, locking ${toLock.id}")
return toLock.id
val initialJobId = getInitialJobId(projectId)
logger.info("Initial locked job $initialJobId for project ${toLock.projectId}")
if (initialJobId == null) {
logger.debug("No initial job found, locking only ${toLock.id}")
return setOf(toLock.id)
}
val newLockedJobIds = mutableSetOf<Long>(initialJobId)
if (newLockedJobIds.size < batchProperties.projectConcurrency) {
logger.debug("Locking job ${toLock.id} for project $projectId. Active jobs before: $newLockedJobIds")
newLockedJobIds.add(toLock.id)
} else {
logger.debug(
"Cannot lock job ${toLock.id} for project $projectId, limit reached. Active jobs: $newLockedJobIds"
)
}
return newLockedJobIds
}
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

State recovery only seeds one existing job; can exceed concurrency when multiple jobs already RUNNING

If there are ≥2 RUNNING/started jobs in DB and the in-memory/redis set is empty, seeding a single job undercounts and can admit extra jobs above projectConcurrency.

Apply this change to seed up to projectConcurrency existing jobs:

-      val initialJobId = getInitialJobId(projectId)
-      logger.info("Initial locked job $initialJobId for project ${toLock.projectId}")
-      if (initialJobId == null) {
-        logger.debug("No initial job found, locking only ${toLock.id}")
-        return setOf(toLock.id)
-      }
-      val newLockedJobIds = mutableSetOf<Long>(initialJobId)
-      if (newLockedJobIds.size < batchProperties.projectConcurrency) {
+      val initialJobIds = getInitialJobIds(projectId, batchProperties.projectConcurrency)
+      logger.info("Initial locked jobs $initialJobIds for project ${toLock.projectId}")
+      if (initialJobIds.isEmpty()) {
+        logger.debug("No initial jobs found, locking only ${toLock.id}")
+        return setOf(toLock.id)
+      }
+      val newLockedJobIds = initialJobIds.toMutableSet()
+      if (newLockedJobIds.size < batchProperties.projectConcurrency) {
         logger.debug("Locking job ${toLock.id} for project $projectId. Active jobs before: $newLockedJobIds")
         newLockedJobIds.add(toLock.id)
       } else {
         logger.debug(
           "Cannot lock job ${toLock.id} for project $projectId, limit reached. Active jobs: $newLockedJobIds"
         )
       }
       return newLockedJobIds

Add helper (outside this block):

private fun getInitialJobIds(projectId: Long, limit: Int): Set<Long> {
  if (limit <= 0) return emptySet()
  val jobs = batchJobService.getAllIncompleteJobIds(projectId)

  // Prefer RUNNING
  val running = jobs.filter { it.status == io.tolgee.model.batch.BatchJobStatus.RUNNING }
    .map { it.jobId }
  if (running.isNotEmpty()) return running.take(limit).toSet()

  // Fallback: jobs that have started processing (some chunks locked)
  val unlockedCounts = batchJobService
    .getAllUnlockedChunksForJobs(jobs.map { it.jobId })
    .groupBy { it.batchJobId }
    .mapValues { it.value.count() }
  val started = jobs.filter { it.totalChunks != unlockedCounts[it.jobId] }
    .map { it.jobId }

  return started.take(limit).toSet()
}
🤖 Prompt for AI Agents
In backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt
around lines 127 to 147, the recovery branch seeds only one existing job which
can undercount and allow more than projectConcurrency jobs to run; replace the
single-job lookup with a new helper getInitialJobIds(projectId,
batchProperties.projectConcurrency) (implement it outside this block) that
returns up to `limit` job IDs preferring RUNNING jobs, falling back to jobs that
have started processing (determined by comparing totalChunks vs unlocked chunk
counts), and then use that Set<Long> as the initial lockedJobIds (adding the
toLock.id only if it doesn’t exceed projectConcurrency). Ensure the helper calls
batchJobService.getAllIncompleteJobIds and
batchJobService.getAllUnlockedChunksForJobs as described and that limit<=0
returns empty set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajeans i think here coderabbit is also right. I think it's better to apply the changes, since it will really be correct when BatchProperties.concurrency is bigger than one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdshadow you mean I should switch make getInitialJobId turn into getInitialJobIds?

I'd rather keep the PR minimal and as close to the version I have in production as possible.

Plus this is very theoretical: once, when nothing is in redis, you might not use as many concurrent batches as possible with the new configuration.
What is going to happen is that it will then fill the remaining slots one minute later at most by adding jobs to the locks using the standard case.

If we really wanted to make this more understandable, I'd totally remove getInitialJobId and this specific seeding, and always let normal operations take locks through the standard paths. Now that normal operations favor RUNNING jobs (from the previous merged PR), it would be much simpler and would work.

Again, I'd rather make this PR as contained as possible, and as close as my production / battle tested version as possible if that works for you guys.

@coderabbitai is my understanding correct or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajeans Your understanding is largely correct! Let me clarify the specific edge case:

The edge case scenario:

  1. Service restarts with Redis empty, projectConcurrency=2
  2. Jobs Bump org.springframework.boot from 2.2.0.RELEASE to 2.4.0 #1, Bump jjwt-impl from 0.10.7 to 0.11.2 #2, Bump liquibase-hibernate5 from 3.6 to 4.1.1 #3 are RUNNING in the DB
  3. Job Bump sentry-spring from 1.7.27 to 3.2.0 #4 tries to lock → getInitialJobId finds Job Bump org.springframework.boot from 2.2.0.RELEASE to 2.4.0 #1 → lock set becomes [Job Bump org.springframework.boot from 2.2.0.RELEASE to 2.4.0 #1, Job Bump sentry-spring from 1.7.27 to 3.2.0 #4]
  4. Job Bump org.springframework.boot from 2.2.0.RELEASE to 2.4.0 #1 finishes and unlocks → lock set becomes [Job Bump sentry-spring from 1.7.27 to 3.2.0 #4] (Jobs Bump jjwt-impl from 0.10.7 to 0.11.2 #2, Bump liquibase-hibernate5 from 3.6 to 4.1.1 #3 are phantom/untracked)
  5. Job Bump springdoc-openapi-webmvc-core from 1.3.4 to 1.5.0 #5 tries to lock → lock set has size 1 < 2 → lock set becomes [Job Bump sentry-spring from 1.7.27 to 3.2.0 #4, Job Bump springdoc-openapi-webmvc-core from 1.3.4 to 1.5.0 #5]
  6. Now 4 jobs are running (Jobs Bump jjwt-impl from 0.10.7 to 0.11.2 #2, Bump liquibase-hibernate5 from 3.6 to 4.1.1 #3, Bump sentry-spring from 1.7.27 to 3.2.0 #4, Bump springdoc-openapi-webmvc-core from 1.3.4 to 1.5.0 #5) when limit is 2

However, you're correct that:

  • This only happens once on restart when Redis is empty
  • It's temporary and bounded (you can't get more than projectConcurrency + (initially_running - 1) jobs)
  • As phantom jobs complete, the system self-corrects
  • It's been battle-tested in your production
  • For projectConcurrency=1 (probably common), the issue is minimal

Your alternative suggestion (removing seeding entirely) would also allow temporary over-concurrency in a different way, so it doesn't fundamentally solve the theoretical issue—it just changes when/how it occurs.

Bottom line: This is a reasonable engineering trade-off between theoretical correctness and practical simplicity. If you're comfortable with temporary over-concurrency on restart (which you've validated in production), keeping the PR minimal makes sense. The strict multi-seed approach would eliminate this edge case but adds complexity.

Your call—both approaches are defensible depending on whether you prioritize strict concurrency limits vs. battle-tested simplicity. 👍

@ValueSource(ints = [1, 2])
@ProjectJWTAuthTestMethod
fun `cancels a job`() {
batchDumper.finallyDump {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is batchDumper removed here? It's really convinient for investigation in case of a test failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just maybe batchDumper should be extended with smth like

  fun <T> finallyDumpAll(fn: () -> T): T {
    return try {
      fn()
    } finally {
      getAllJobs().forEach { this.dump(it.id) }
    }
  }

  fun getAllJobs(): List<BatchJob> = entityManager.createQuery("""from BatchJob""", BatchJob::class.java).resultList

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while playing with it i actually noticed that setting batchProperties.projectConcurrency = 2 doesn't work here. It must be passed as application properties in the @SpringBootTest(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re batchDumper, I missed this, will have a look.

re batchProperties.projectConcurrency = 2 how did you notice it doesn't work? Spring will start it with its standard configuration, and then it is overridden before the test, and then returned to the original configuration after the tests.
As the checks against the properties in BatchJobProjectLockingManager are always done dynamically, it should work.
This approach allows "@ParameterizedTest" to be used. With spring test properties, it would require instantiating a different spring boot test context with different methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed the requested batchDumper changes.

Unfortunately tests are all failing on my machine today (due to PG setup), but the change makes sense and it all compiles... Let's see if the CI agrees :D

@bdshadow Would you have any idea why the tests might be failing locally?
The build system seems to have launched the docker containers successfully

CONTAINER ID   IMAGE                           COMMAND                  CREATED       STATUS      PORTS                                           NAMES
ed02aeb8b7a6   postgres:16.3                   "docker-entrypoint.s…"   2 weeks ago   Up 3 days   0.0.0.0:55434->5432/tcp, [::]:55434->5432/tcp   tolgee_backend_tests_postgres_ee
bf2b3485c2f4   postgres:16.3                   "docker-entrypoint.s…"   2 weeks ago   Up 3 days   0.0.0.0:55433->5432/tcp, [::]:55433->5432/tcp   tolgee_backend_tests_postgres_main

@bdshadow
Copy link
Contributor

@ajeans one more comment for discussion here. Jan has already written these concerns about multiple jobs per projects:

When implementing this I wanted to prevent some unexpected situations. For example, when user would start different jobs with the same keys and then 2 parallel jobs might modify data of the same keys. E.g., one job would translate the keys while another job would delete them. I also wanted to prevent one project to take over all the workers while other project would have to wait.

How do you think it must be handled? Did you try to run 2 jobs with the same keys? Or to do batch delete while batch translation is running? Actually, some tests for these cases would be good. It would show what is the expected behaviour in these cases.

@ajeans
Copy link
Contributor Author

ajeans commented Oct 17, 2025

@ajeans one more comment for discussion here. Jan has already written these concerns about multiple jobs per projects:

When implementing this I wanted to prevent some unexpected situations. For example, when user would start different jobs with the same keys and then 2 parallel jobs might modify data of the same keys. E.g., one job would translate the keys while another job would delete them. I also wanted to prevent one project to take over all the workers while other project would have to wait.

How do you think it must be handled? Did you try to run 2 jobs with the same keys? Or to do batch delete while batch translation is running? Actually, some tests for these cases would be good. It would show what is the expected behaviour in these cases.

@bdshadow unfortunately, that is beyond my area of expertise.

We've been running with a project concurrency at 10 and have no seen no significant problems in production.
I admit we have only one LLM configured to asynchronously modify translations, not several providers at once.

My assumption was that you would keep the project concurrency at 1. We could even add a warning about this, saying anything >1 is experimental.

Regards

@bdshadow
Copy link
Contributor

@ajeans thank you for the clarification.
As far as i remember, the main problem you were trying to solve is that the batch translation of a huge number of keys took you too much time. Did you try to increase the BatchProperties.concurrency after the first fix is merged? I saw your email, where you said that you experienced problems with it, but i believe it was before the first pr was merged.

@ajeans
Copy link
Contributor Author

ajeans commented Oct 18, 2025

@ajeans thank you for the clarification. As far as i remember, the main problem you were trying to solve is that the batch translation of a huge number of keys took you too much time. Did you try to increase the BatchProperties.concurrency after the first fix is merged? I saw your email, where you said that you experienced problems with it, but i believe it was before the first pr was merged.

@bdshadow

What we are currently running in production is

  • the branch in the original PR: fix: batch job processing #3262
  • 2 servers that serve HTTP traffic running with TOLGEE_BATCH_CONCURRENCY=10 and TOLGEE_BATCH_PROJECT_CONCURRENCY=1
  • 2 servers that don't serve HTTP traffic running with TOLGEE_BATCH_CONCURRENCY=10 and TOLGEE_BATCH_PROJECT_CONCURRENCY=5

In terms of code, the running code is very close to #3265 plus #3271. Main differences are tests (much better in the upstream version) and some code style improvements that didn't change the logic.

Now if I explain the history on our production systems.

As we have 1 project in production, obviously the per project lock is a major bottleneck.

We are well on our way to get the batch jobs backlog down to 0 in a few days with this version.

My current plan is that when this PR is merged, and the backlog is gone in production, to move to the latest vanilla version and do the regression testing in QA, and the battle testing in production.

Hope it makes sense. 🙏

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt (1)

163-263: Consider testing concurrent operations on the same keys.

The test validates locking behavior when running two jobs concurrently with projectConcurrency=2, which is valuable. However, per the PR discussion, there are concerns about "concurrent jobs acting on the same keys (e.g., one translating while another deletes)."

Consider adding tests that verify data integrity when multiple concurrent jobs operate on overlapping key sets with different operations (translate vs. delete, etc.).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05908af and 20ee69f.

📒 Files selected for processing (2)
  • backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt (6 hunks)
  • backend/app/src/test/kotlin/io/tolgee/util/BatchDumper.kt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt (2)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/AbstractBatchJobManagementControllerTest.kt (1)
  • saveAndPrepare (76-80)
backend/testing/src/main/kotlin/io/tolgee/AbstractSpringTest.kt (1)
  • executeInNewTransaction (256-262)
🔇 Additional comments (3)
backend/app/src/test/kotlin/io/tolgee/util/BatchDumper.kt (1)

81-93: LGTM! Addresses past review feedback.

The finallyDumpAll and getAllJobs methods implement the functionality requested in previous review comments, enabling bulk diagnostic dumps for tests with multiple batch jobs.

backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt (2)

3-3: LGTM! Good test isolation.

Adding BatchJobProjectLockingManager and BatchProperties dependencies enables lock verification and concurrency control. The setup/teardown correctly preserves and restores the original projectConcurrency setting, ensuring proper test isolation.

Also applies to: 36-55


57-123: LGTM! Enhanced test coverage.

The parameterized test now validates both single and dual project concurrency scenarios, and the switch to finallyDumpAll addresses the past review feedback. The lock verification steps confirm that jobs are properly locked during execution and released upon cancellation.

Comment on lines +176 to +194
doAnswer {
val jobId = it.getArgument<io.tolgee.batch.data.BatchJobDto>(0).id
val allJobs = executeInNewTransaction {
batchJobService.getAllByProjectId(testData.project.id)
}
val isFirstJob = allJobs.size > 0 && jobId == allJobs[0].id

val counter = if (isFirstJob) firstJobCount else secondJobCount

if (counter.incrementAndGet() > 5) {
while (simulateLongRunningChunkRun) {
// this simulates long-running operation, which checks for active context
val context = it.arguments[2] as CoroutineContext
context.ensureActive()
Thread.sleep(10)
}
}
it.callRealMethod()
}.whenever(machineTranslationChunkProcessor).process(any(), any(), any(), any())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Ensure deterministic job ordering in the test.

The mock logic at line 181 assumes allJobs[0] is consistently the "first" job created, but getAllByProjectId may lack explicit ordering, leading to non-deterministic results. If the query does not guarantee order (e.g., by ID or creation timestamp), the test could be flaky.

Verify that getAllByProjectId orders jobs deterministically. If not, consider:

  1. Adding explicit ordering to the query, or
  2. Identifying jobs by storing the returned job IDs from the POST responses (lines 197-206 and 213-222) and using those IDs directly instead of relying on array index.

Example approach using stored IDs:

// After line 206
val firstJobId = performProjectAuthPost(...)
  .andIsOk.andReturn().response.contentAsString.let { 
    // parse job ID from response 
  }

Then reference firstJobId directly in the mock logic instead of checking allJobs[0].id.

Also applies to: 228-230


🏁 Script executed:

#!/bin/bash
# Find BatchJobService and its getAllByProjectId method
ast-grep --pattern 'fun getAllByProjectId($_, $_) {
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for getAllByProjectId usages and definitions in the codebase
rg -nP --type=kotlin 'getAllByProjectId' -B3 -A8

Length of output: 25551


🏁 Script executed:

#!/bin/bash
# Examine the test file to understand job ID storage and mock usage
cat backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt | head -250

Length of output: 7613


🏁 Script executed:

#!/bin/bash
# Find BatchJobRepository and its findAllByProjectId method
rg -nP --type=kotlin 'class BatchJobRepository|fun findAllByProjectId' -B2 -A10 | head -100

Length of output: 10377


🏁 Script executed:

#!/bin/bash
# Search more specifically for the repository definition and query
find . -name "BatchJobRepository.kt" -type f -exec cat {} \;

Length of output: 3402


🏁 Script executed:

#!/bin/bash
# Check the POST response to see if job IDs are returned
grep -n "performProjectAuthPost" backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt | head -20

Length of output: 96


🏁 Script executed:

#!/bin/bash
# Look at the batch job controller to see what the POST response contains
find . -path "*/controller/*BatchJob*Controller.kt" -type f | head -5

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check what util.getSingleJob() does and if responses contain job IDs
grep -n "getSingleJob\|startBatchJob" backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/*.kt | head -20

Length of output: 1243


🏁 Script executed:

#!/bin/bash
# Find the abstract test class
find . -name "AbstractBatchJobManagementControllerTest.kt" -type f -exec cat {} \;

Length of output: 2604


🏁 Script executed:

#!/bin/bash
# Search for the util field initialization and getSingleJob implementation
grep -n "class.*util\|val util\|fun getSingleJob" backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/*.kt -A5

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check what the POST response looks like (check controller or DTO)
rg -nP --type=kotlin 'BatchJobDto|BatchJobResponse' backend/app/src/main/kotlin --type=kotlin -C3 | head -50

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find BatchJobTestUtil
find . -name "BatchJobTestUtil.kt" -type f -exec cat {} \;

Length of output: 15364


🏁 Script executed:

#!/bin/bash
# Search for BatchJobTestUtil and its methods
rg -nP --type=kotlin 'class BatchJobTestUtil|fun getSingleJob' -A10

Length of output: 3610


🏁 Script executed:

#!/bin/bash
# Check the batch job data models to see what fields are available
find . -path "*/batch/data/*" -name "*.kt" -type f | head -10

Length of output: 668


🏁 Script executed:

#!/bin/bash
# Find the batch job controller endpoint for POST machine-translate
rg -nP --type=kotlin '"start-batch-job"' -B5 -A15 | head -80

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look for BatchJobDto to see if it contains ID
cat backend/data/src/main/kotlin/io/tolgee/batch/data/BatchJobDto.kt

Length of output: 1702


Deterministic job identification needed—getAllByProjectId lacks explicit ordering.

The mock logic assumes allJobs[0] is consistently the first job, but getAllByProjectId has no @query annotation, so it generates a query without ORDER BY. This makes result ordering non-deterministic and could cause the test to fail or behave unpredictably.

Store and use job IDs from the POST responses (lines 197-206, 213-222) instead of relying on array indices at lines 181, 236-237. Since BatchJobDto exposes an id field, extract and store the returned IDs:

val firstJobIdResponse = performProjectAuthPost(
  "start-batch-job/machine-translate",
  mapOf(...)
).andIsOk.andReturn().response.contentAsString
// Parse firstJobId from JSON response

Then reference stored IDs directly in the mock and assertions instead of allJobs[0].id.

@bdshadow
Copy link
Contributor

@ajeans thank you very much for a detailed explanation.

I understand your motivation behind these changes. However, it will affect lots of other batch processes, which can lead to unpredictable results when launched simultaneously. I also understand that in your specific case it works, but i'm trying to find a solution which is more general.

Did you try to just increase the TOLGEE_BATCH_CONCURRENCY? I mean run it with just the first fix and for servers that don't serve HTTP traffic just increase the TOLGEE_BATCH_CONCURRENCY to, for example, 20, or even more. Maybe in this case, the BatchProperties.chunkQueuePopulationSize can be increased too

@ajeans
Copy link
Contributor Author

ajeans commented Oct 24, 2025

@ajeans thank you very much for a detailed explanation.

Hi @bdshadow

I understand your motivation behind these changes. However, it will affect lots of other batch processes, which can lead to unpredictable results when launched simultaneously. I also understand that in your specific case it works, but i'm trying to find a solution which is more general.

Feel free to find other ways to address the scalability issue with big projects with a lot of translations.
I'm eager to have Tolgee servers that can address my traffic in production, but I have no emotional attachment to this PR.

I only went this route and submitted because @JanCizmar suggested it was a way to address this.
And if you make "1" the default, there is absolutely no downside to it.
I don't understand the reluctance, but that is obviously your call.

Irrelevant of this PR or any other solution, do you think tolgee will support my performance requirements in the near future? I'd rather know now if this will not happen please.
Production traffic will keep rising in the coming months, and I'd rather plan ahead the next steps.

Did you try to just increase the TOLGEE_BATCH_CONCURRENCY? I mean run it with just the first fix and for servers that don't serve HTTP traffic just increase the TOLGEE_BATCH_CONCURRENCY to, for example, 20, or even more. Maybe in this case, the BatchProperties.chunkQueuePopulationSize can be increased too

AFAIU, current main is identical to the version that managed 300 to 1000 batches a day. (see #3271 (comment)). There were no changes in the batches in the last 3 months, and my patches applied cleanly.

I don't believe that playing with chunks will be a 10x improvement to the workload, especially as we are mostly bound to the active number of threads calling OpenAI translations.

I'm on a vacation next week. If you have more questions I will answer them when I'm back.

@JanCizmar by then, can you weigh in as whether you see tolgee handling our performance requirements?

@JanCizmar
Copy link
Contributor

Hello @ajeans !

Thanks for raising this pull request and for the detailed context 😊

@bdshadow is part of the Tolgee team and he’s currently focused on improving batch operations performance across both self‑hosted and Tolgee Cloud instances. We absolutely agree that the current batch throughput is a bottleneck.

At the same time, we’re cautious about making the project lock limit configurable. Allowing multiple concurrent jobs per project could lead to race conditions or data inconsistencies when two jobs operate on the same keys. That’s why we originally kept the per‑project limit at one. If you need more throughput, increasing the global TOLGEE_BATCH_CONCURRENCY should have the same effect. If it doesn’t, it suggests another bottleneck that we need to fix, and we’ll look into that.

Our goal is to find a solution that keeps project locking safe while giving a significant performance boost. Dmitrii is working hard on this now, and we hope to have a robust fix soon. We’re should also improve our internal monitoring so we can see potential bottlenecks in Grafana.

Thanks again for tackling this and providing so much context; it really helps us prioritise and improve Tolgee 🚀

@ajeans
Copy link
Contributor Author

ajeans commented Oct 25, 2025

@JanCizmar

Thank you for the detailed response. I'd like to provide more context on why increasing global TOLGEE_BATCH_CONCURRENCY alone doesn't solve the scalability issue for our production workload.
I also want to share the performance requirements that we have and that would be expected from the upstream version of tolgee

Production Timeline

tolgee-production-timeline

This graph shows daily successful batch job completions from September through today:

  1. Baseline (Sept 1-22): ~10-100 jobs/day with various TOLGEE_BATCH_CONCURRENCY settings (1-20). The system was severely bottlenecked.

  2. Mass Cancellation (Sept 22): We manually cancelled ~110,000 stuck jobs to try to recover.

  3. PR fix: batch job processing #3265 Applied (~Sept 29): The phantom lock fixes brought us to ~1,000 jobs/day - a significant improvement, but still insufficient.

  4. PR fix: batch job processing #3265 + feat: implement project locks containing multiple job ids #3271 Applied (~Oct 6): Adding multi-job-per-project capability with projectConcurrency=5 jumped throughput to 15,000-21,000 jobs/day - a 15x improvement over fix: batch job processing #3265 alone.

Our Production Architecture

  • 4 Tolgee servers in a Redis-coordinated cluster

    • 2 servers handle HTTP traffic: TOLGEE_BATCH_CONCURRENCY=10, TOLGEE_BATCH_PROJECT_CONCURRENCY=1

    • 2 dedicated batch workers: TOLGEE_BATCH_CONCURRENCY=10, TOLGEE_BATCH_PROJECT_CONCURRENCY=5

  • 1 large project with >300,000 translation keys

  • Continuous batch translation workload (LLM-based)

Why Global Concurrency Alone Doesn't Work

The issue is the per-project lock bottleneck:

4 servers × 10 concurrent workers = 40 threads trying to process jobs

All jobs belong to 1 project

Only 1 job can hold the project lock at a time

Only chunks from that 1 job can be processed across all servers

Other threads spin waiting for chunks from the locked job

With the current single-job-per-project locking:

  • Only 1 job across all 4 servers can hold the project lock

  • Threads can process multiple chunks from that one job concurrently (good!)

  • BUT when chunks from the locked job aren't readily available in the local queue, threads must wait

  • Meanwhile, chunks from other jobs (which could be processed) are blocked by the project lock

  • The system becomes starved: threads are idle waiting for the "right" chunks while work sits in queue

  • Result: ~1,000 jobs/day maximum throughput despite 40 available threads

With projectConcurrency=5:

  • 5 different jobs can hold the project lock simultaneously

  • This means chunks from 5 different jobs are eligible for processing at any time

  • Threads have a much higher probability of finding processable chunks in their queue

  • Better work distribution across the 40 available threads

  • Result: ~15,000 jobs/day throughput

The bottleneck isn't CPU or I/O - it's work starvation caused by the single-job lock limiting which chunks are eligible for processing.

Addressing the Safety Concerns

I understand the concern about concurrent modifications to the same keys. However:

  1. Chunk-level isolation already exists: Each BatchJobChunkExecution is an atomic unit. Two chunks never operate on the same keys simultaneously - that's guaranteed by the chunking logic itself.

  2. Job-level conflicts are rare in practice: In our production workload, users don't typically launch conflicting job types (translate + delete) on the same keys simultaneously. When they do create a new batch job, it's usually after the
    previous one completes.

  3. Default value keeps existing behavior: With projectConcurrency defaulting to 1, existing deployments get zero behavior change. The safety model remains identical.

  4. Opt-in for advanced users: Deployments like ours with a single large project and controlled batch workflows can opt into higher concurrency after understanding the trade-offs.

Path Forward

I respectfully suggest:

  • Keep projectConcurrency=1 as the default (no behavior change for existing users)

  • Add a configuration warning in the docs: "Values >1 are experimental and should only be used in controlled environments where batch job conflicts are managed externally"

This unblocks high-throughput single-project deployments like ours

Gives the Tolgee team time to explore a more comprehensive solution for multi-project safety, while Rakuten France can safely operate during the shopping season.

The alternative is that deployments with large projects cannot scale beyond ~1,000 jobs/day on the current architecture, even with unlimited hardware resources (tolgee instances or postgres databases).

Thanks for your consideration!

@bdshadow
Copy link
Contributor

@ajeans thank you very much for such a detailed report. It is really helpful.

What I don't like is the part "With the current single-job-per-project locking" about the processing of chunks by threads. Especially the following:

BUT when chunks from the locked job aren't readily available in the local queue, threads must wait

And when i say i don't like it, i mean the way tolgee does it.

I think i found the root of a problem with populating the in-memory queue. Please check the first pull request: #3287. It has a detailed explanation + it's duplicated in the commit messages.

Shortly: locking the db rows didn't work when selecting chunks for processing.

But as described in the commit messages, it still doesn't fully fix the problem. Let's imagine we have 2 pods running, each trying to populate the in-memory queue. The first one locks the rows in the chunks table and sends them to processing. Transaction is finished, rows in the chunks table are unlocked. Now the 2nd pod comes, before the actual processing of the chunks is actually started in the first pod(update of the status from PENDING to RUNNING). And the second pod selects absolutely the same chunks and sends them to its own queue. Now these pods are conflicting for processing the same chunks.

I see that it can be fixed by introducing a new status for chunks: NEW. And when BatchJobChunkExecutionQueue.populateQueue runs with locking db rows, it will also update the status from NEW to PENDING. This way, another pod won't be able to take them and chunks will be uniquely distributed among all k8s pods.

I hope with these changes there won't be need in allowing multiple batch jobs per project

@bdshadow bdshadow mentioned this pull request Oct 30, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants