Skip to content

Stateful Core ML generation does not support mixed-attention models that require multiple mask inputs #330

@Skyline-23

Description

@Skyline-23

Summary

swift-transformers currently assumes a single-mask Core ML generation contract for stateful text generation.

This works for models that can be driven by one causal attention mask, but it breaks for stateful Core ML exports that require multiple mask inputs during generation, for example:

  • one mask for full-attention layers
  • one mask for sliding-window or local-attention layers

As a result, there is currently no stable way to use such stateful Core ML exports with the stock LanguageModel generation path.

Current runtime behavior

In Sources/Models/LanguageModel.swift, the stateful generation path only knows these input keys:

  • inputIds
  • attentionMask
  • causalMask
  • keyCache
  • valueCache

Relevant locations:

  • LanguageModel.Keys only defines attentionMask / causalMask
  • predictNextTokenScores builds a single [1, 1, 1, tokenCount + 1] mask and passes that to the model

Why this is a problem

Some decoder-only Core ML exports use mixed attention:

  • some layers use full attention
  • some layers use sliding-window or local attention

Those exports may need multiple attention masks during generation instead of a single causal mask.

If the model exposes multiple attention-mask inputs, the current swift-transformers generation runtime cannot drive it.

Important scope clarification

This issue is specifically about supporting explicit multi-mask Core ML model contracts in the runtime.

A separate experimental exporter path that tried to reconstruct multiple masks inside the Core ML graph from a single causalMask turned out to be unstable in practice, with failures such as:

perm tensor length must equal input tensor rank, 4 != 8
'mps.tile' op input rank: 8 should match multiplier length: 4
original module failed verification

That exporter-side issue is separate. The request here is to support the stable explicit-input contract in the runtime.

Repro context

  • Export target: stateful Core ML (iOS 18, StateType KV cache)
  • Runtime: latest swift-transformers main
  • swift-transformers commit tested: e5e227b

Expected behavior

There should be a supported stateful Core ML generation path for decoder models that require multiple mask inputs.

Possible fix directions

  1. Extend the runtime to support additional attention-mask inputs when they are present in modelDescription.

  2. Add a more general mixed-attention Core ML generation path instead of assuming a single-mask contract.

  3. If this is currently out of scope, document that the stock stateful Core ML generation API only supports single-mask causal-attention models.

Why this matters

Without this support, any mixed-attention Core ML export requires a custom runtime instead of the stock swift-transformers generation API.

Additional context

Example converted Core ML repo using the explicit multi-mask contract:
https://huggingface.co/Skyline23/translategemma-4b-it-coreml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions