Stateful Core ML generation does not support mixed-attention models that require multiple mask inputs

Summary

swift-transformers currently assumes a single-mask Core ML generation contract for stateful text generation.

This works for models that can be driven by one causal attention mask, but it breaks for stateful Core ML exports that require multiple mask inputs during generation, for example:
- one mask for full-attention layers
- one mask for sliding-window or local-attention layers

As a result, there is currently no stable way to use such stateful Core ML exports with the stock LanguageModel generation path.

Current runtime behavior

In Sources/Models/LanguageModel.swift, the stateful generation path only knows these input keys:
- inputIds
- attentionMask
- causalMask
- keyCache
- valueCache

Relevant locations:
- LanguageModel.Keys only defines attentionMask / causalMask
- predictNextTokenScores builds a single [1, 1, 1, tokenCount + 1] mask and passes that to the model

Why this is a problem

Some decoder-only Core ML exports use mixed attention:
- some layers use full attention
- some layers use sliding-window or local attention

Those exports may need multiple attention masks during generation instead of a single causal mask.

If the model exposes multiple attention-mask inputs, the current swift-transformers generation runtime cannot drive it.

Important scope clarification

This issue is specifically about supporting explicit multi-mask Core ML model contracts in the runtime.

A separate experimental exporter path that tried to reconstruct multiple masks inside the Core ML graph from a single causalMask turned out to be unstable in practice, with failures such as:

perm tensor length must equal input tensor rank, 4 != 8
'mps.tile' op input rank: 8 should match multiplier length: 4
original module failed verification

That exporter-side issue is separate. The request here is to support the stable explicit-input contract in the runtime.

Repro context

- Export target: stateful Core ML (iOS 18, StateType KV cache)
- Runtime: latest swift-transformers main
- swift-transformers commit tested: e5e227ba3225644d8a98c518eff8fb79be652223

Expected behavior

There should be a supported stateful Core ML generation path for decoder models that require multiple mask inputs.

Possible fix directions

1. Extend the runtime to support additional attention-mask inputs when they are present in modelDescription.

2. Add a more general mixed-attention Core ML generation path instead of assuming a single-mask contract.

3. If this is currently out of scope, document that the stock stateful Core ML generation API only supports single-mask causal-attention models.

Why this matters

Without this support, any mixed-attention Core ML export requires a custom runtime instead of the stock swift-transformers generation API.

Additional context

Example converted Core ML repo using the explicit multi-mask contract:
https://huggingface.co/Skyline23/translategemma-4b-it-coreml


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateful Core ML generation does not support mixed-attention models that require multiple mask inputs #330

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stateful Core ML generation does not support mixed-attention models that require multiple mask inputs #330

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions