-
Notifications
You must be signed in to change notification settings - Fork 173
Stateful Core ML generation does not support mixed-attention models that require multiple mask inputs #330
Description
Summary
swift-transformers currently assumes a single-mask Core ML generation contract for stateful text generation.
This works for models that can be driven by one causal attention mask, but it breaks for stateful Core ML exports that require multiple mask inputs during generation, for example:
- one mask for full-attention layers
- one mask for sliding-window or local-attention layers
As a result, there is currently no stable way to use such stateful Core ML exports with the stock LanguageModel generation path.
Current runtime behavior
In Sources/Models/LanguageModel.swift, the stateful generation path only knows these input keys:
- inputIds
- attentionMask
- causalMask
- keyCache
- valueCache
Relevant locations:
- LanguageModel.Keys only defines attentionMask / causalMask
- predictNextTokenScores builds a single [1, 1, 1, tokenCount + 1] mask and passes that to the model
Why this is a problem
Some decoder-only Core ML exports use mixed attention:
- some layers use full attention
- some layers use sliding-window or local attention
Those exports may need multiple attention masks during generation instead of a single causal mask.
If the model exposes multiple attention-mask inputs, the current swift-transformers generation runtime cannot drive it.
Important scope clarification
This issue is specifically about supporting explicit multi-mask Core ML model contracts in the runtime.
A separate experimental exporter path that tried to reconstruct multiple masks inside the Core ML graph from a single causalMask turned out to be unstable in practice, with failures such as:
perm tensor length must equal input tensor rank, 4 != 8
'mps.tile' op input rank: 8 should match multiplier length: 4
original module failed verification
That exporter-side issue is separate. The request here is to support the stable explicit-input contract in the runtime.
Repro context
- Export target: stateful Core ML (iOS 18, StateType KV cache)
- Runtime: latest swift-transformers main
- swift-transformers commit tested: e5e227b
Expected behavior
There should be a supported stateful Core ML generation path for decoder models that require multiple mask inputs.
Possible fix directions
-
Extend the runtime to support additional attention-mask inputs when they are present in modelDescription.
-
Add a more general mixed-attention Core ML generation path instead of assuming a single-mask contract.
-
If this is currently out of scope, document that the stock stateful Core ML generation API only supports single-mask causal-attention models.
Why this matters
Without this support, any mixed-attention Core ML export requires a custom runtime instead of the stock swift-transformers generation API.
Additional context
Example converted Core ML repo using the explicit multi-mask contract:
https://huggingface.co/Skyline23/translategemma-4b-it-coreml