Turbopack: Fix near-duplicate chunks and non-deterministic ordering in production builds#90710
Turbopack: Fix near-duplicate chunks and non-deterministic ordering in production builds#90710lukesandberg wants to merge 4 commits intocanaryfrom
Conversation
Add absorption pass to the production chunking merge algorithm. After the
main merge loop, small remaining candidates that couldn't find a merge
partner are now checked against existing heap chunks. If a small item's
bitmap overlaps sufficiently with a heap chunk's bitmap (duplication cost
exceeds extra download cost), the small item is absorbed into that chunk
rather than creating a near-duplicate.
This fixes cases where a tiny module (e.g. 360B) with bitmap {0,1,2}
would create a separate near-identical chunk instead of being absorbed
into a large chunk (e.g. 60KB) with bitmap {0,1}.
Also adds merge.rs with a pure Vc-free extraction of the merge algorithm
and 7 unit tests covering the near-duplicate scenarios.
Refactor production.rs to call merge_grouped_chunks() from merge.rs instead of having its own copy of the merge algorithm. This eliminates ~500 lines of duplicated merge logic. production.rs now: 1. Resolves Vc values and groups items by bitmap (unchanged) 2. Converts groups to GroupInput (size + bitmap + batch_group_id) 3. Calls merge_grouped_chunks() for the merge + absorption pass 4. Maps results back to make_chunk() calls merge.rs now exposes two entry points: - merge_grouped_chunks(): takes pre-grouped items (used by production.rs) - merge_chunks(): groups items by bitmap first, then calls merge_grouped_chunks() (used by unit tests) Test-only types (ChunkItemForMerging, MergedChunkInfo, merge_chunks, hash_bitmap) are gated with #[cfg(test)].
…shes Different routes perform independent DFS traversals from different entry points, producing different topological orderings of shared modules. Since production chunk filenames are derived from content hashes of the serialized JS bytes, this ordering instability causes the same set of modules to produce chunks with different filenames across routes, leading to duplicate downloads. Fix by sorting chunk items by ModuleId before serialization in both browser and Node.js chunk content emitters. This ensures identical module sets always produce identical content hashes regardless of traversal order.
…ks integration After integrating merge_chunks into production.rs, chunk items were being collected in the order that groups were merged, which is non-deterministic due to hash map iteration order and heap operations in the merge algorithm. The merge algorithm extends group_indices and batch_group_ids in whatever order groups are encountered during merging. When production.rs iterates over these unsorted indices to collect chunk items, the same set of modules produces different output orders across builds. Fix by sorting group_indices and batch_group_ids before iterating over them to collect chunk items. This ensures deterministic output regardless of the merge algorithm's internal ordering. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Merging this PR will not alter performance
Comparing Footnotes
|
Failing test suitesCommit: 5aa9876 | About building and testing Next.js
Expand output● Error recovery app › can recover from a event handler error ● Error recovery app › render error not shown right after syntax error
Expand output● ReactRefreshLogBox › module init error not shown
Expand output● ReactRefreshLogBox app › logbox: anchors links in error messages ● ReactRefreshLogBox app › Should not show webpack_exports when exporting anonymous arrow function ● ReactRefreshLogBox app › Unhandled errors and rejections opens up in the minimized state
Expand output● deterministic build - changing deployment id › build output API - builder › should produce identical build outputs even when changing deployment id |
Stats from current PR✅ No significant changes detected📊 All Metrics📖 Metrics GlossaryDev Server Metrics:
Build Metrics:
Change Thresholds:
⚡ Dev Server
📦 Dev Server (Webpack) (Legacy)📦 Dev Server (Webpack)
⚡ Production Builds
📦 Production Builds (Webpack) (Legacy)📦 Production Builds (Webpack)
📦 Bundle SizesBundle Sizes⚡ TurbopackClient Main Bundles: **401 kB** → **401 kB** ✅ -14 B80 files with content-based hashes (individual files not comparable between builds) Server Middleware
Build DetailsBuild Manifests
📦 WebpackClient Main Bundles
Polyfills
Pages
Server Edge SSR
Middleware
Build DetailsBuild Manifests
Build Cache
🔄 Shared (bundler-independent)Runtimes
📎 Tarball URL |
| write!(code, ",")?; | ||
| } | ||
|
|
||
| // Sort chunk items by module ID to ensure deterministic output regardless of |
There was a problem hiding this comment.
That's not good for compression (I accidentally did it with webpack). Instead sort by identifier, which is path order, which is much better
There was a problem hiding this comment.
But sorting shouldn't happen during code generation, but when the chunk is created. The list of chunk items can be sorted
| // Absorption pass: try to absorb small remaining candidates into existing heap chunks. | ||
| // This prevents tiny modules with slightly different bitmaps from creating near-duplicate | ||
| // chunks. A small module (e.g. 360B) with bitmap {0,1,2} should be absorbed into a large | ||
| // chunk (e.g. 60KB) with bitmap {0,1} rather than creating a separate near-identical chunk. |
There was a problem hiding this comment.
The explanation doesn't make sense. When doing so chunk group 2 would ship extra 60kb.
The reverse case might make sense, but it would also overship modules, which we previously avoided. (We never send modules you don't need)
There was a problem hiding this comment.
I think what would make sense is to duplicate the small module, put it in the {0,1} chunk and leave a chunk with {2} and the small module.
That would reduce the requests for the groups 0 and 1.
|
I think extracting the logic and adding test cases is valueable. Would be cool if that would be a separate PR |
|
yeah good idea on that v-work got a little aggressive with this one. |
What?
Four changes to the Turbopack production chunking algorithm:
Absorption pass for near-duplicate chunks — After the main merge loop, small remaining candidates that couldn't find a merge partner are now checked against existing heap chunks. If a small item's bitmap overlaps sufficiently (duplication cost > extra download cost), the small item is absorbed rather than creating a near-duplicate chunk.
Extract merge algorithm into
merge.rs— The merge/absorption logic is extracted fromproduction.rsinto a standalonemerge.rsmodule with pure (Vc-free) types, enabling direct unit testing.production.rsnow callsmerge_grouped_chunks()instead of duplicating the algorithm.Sort chunk items by ModuleId in content emitters — Different routes perform independent DFS traversals producing different module orderings. Sorting by
ModuleIdbefore serialization in both browser and Node.js content emitters ensures identical module sets always produce identical content hashes.Sort group indices after merge — The merge algorithm's
group_indicesandbatch_group_idsare collected in non-deterministic order (hash maps + heap). Sorting them inproduction.rsbefore collecting chunk items ensures deterministic output.Why?
Users reported that Turbopack creates duplicate (or nearly duplicate) chunks for shared layouts, leading to excess download bytes. Two root causes:
{0,1,2}would create a separate chunk instead of being absorbed into a large chunk (e.g. 60KB) with bitmap{0,1}, resulting in two nearly identical chunks being downloaded.How?
merge.rs(new file):merge_grouped_chunks(): Core merge algorithm operating onGroupInput(size + bitmap + batch_group_id). Used byproduction.rs.merge_chunks(): Test-only wrapper that groups items by bitmap first, then delegates tomerge_grouped_chunks().production.rs(refactored):GroupInput, callsmerge_grouped_chunks().group_indicesandbatch_group_idsbefore iterating to ensure deterministic chunk item collection.content.rs(browser + nodejs):ModuleId, then serializes. Ensures identical content hashes regardless of DFS traversal order.Testing
merge.rscovering the merge algorithm directly (near-duplicate absorption, layout sharing, deep nesting, input-order stability, characterization, no-merge config)