Skip to content

feat(apple foundation models)#9473

Open
zombat wants to merge 23 commits intostanfordnlp:mainfrom
zombat:feature/apple-foundation-models
Open

feat(apple foundation models)#9473
zombat wants to merge 23 commits intostanfordnlp:mainfrom
zombat:feature/apple-foundation-models

Conversation

@zombat
Copy link
Copy Markdown

@zombat zombat commented Mar 19, 2026

feat: Add AppleFoundationLM and AppleLocalLM — native Apple Silicon & Apple Intelligence backends

Summary

This PR adds two new BaseLM subclasses that allow DSPy programs to run entirely on-device
on Apple Silicon, with no cloud dependency:

  • dspy.AppleFoundationLM — wraps Apple's LanguageModelSession (macOS 26+, Apple
    Intelligence) with native constrained decoding for structured outputs and full fm.Tool
    support for tool calling.
  • dspy.AppleLocalLM — wraps mlx-lm to run any HuggingFace model natively on Apple
    Silicon. Supports the full DSPy optimizer workflow including BootstrapFewShot and
    MIPROv2.

Both adapters are zero-regression on Linux CI: all platform-gated imports are lazily
loaded and wrapped in try/except guards. The entire test suite (69 unit tests) passes on
Linux/WSL with no macOS dependencies.

Reviewer Note: Feature-Pass Evidence vs Pre-Existing Failures

Feature-pass evidence (this PR path)

  • pre-commit run --all-files passes on this branch.
  • Branch-scoped Ruff checks pass for files changed in this PR.
  • Apple-path tests pass locally:
    • tests/clients/test_apple_fm.py
    • tests/clients/test_apple_local.py
    • tests/integration/test_apple_fm_integration.py
    • Result: 80 passed, 0 failed.

Pre-existing failures (not introduced by this PR)

  • Full-suite local run (pytest -vv tests/) reports failures outside Apple-path files.
  • Dominant failure pattern is baseline adapter/client behavior (for example,
    AttributeError: 'ModelResponse' object has no attribute 'usage') in non-Apple tests.
  • These failures are treated as repository baseline issues and are out of scope for this Apple adapter PR.

Apple-Only Reviewer Checklist

  • Confirm import safety on non-macOS: import dspy succeeds when Apple SDKs are absent.
  • Verify dspy.AppleFoundationLM init guards: clear errors for non-macOS / missing SDK / unavailable model.
  • Verify dspy.AppleLocalLM init guards: clear errors for non-macOS, Intel Mac, missing mlx-lm.
  • Validate structured output path for AppleFoundationLM:
    • Pydantic response_format maps to native constrained decoding when supported.
    • Fallback path logs warning and still returns valid DSPy-parseable output.
  • Validate tool-calling behavior:
    • AppleFoundationLM executes DSPy tools via fm.Tool wrapping.
    • AppleLocalLM rejects tools=[...] with explicit NotImplementedError.
  • Validate streaming behavior:
    • AppleLocalLM supports dspy.streamify() token streaming.
    • stream=True direct call behavior remains explicit and documented.
  • Validate cache behavior on both adapters:
    • Cache hit short-circuits model call.
    • Cache key excludes DSPy-internal noise fields.
  • Validate concurrency gate in AppleLocalLM (max_concurrency) prevents uncontrolled parallel MLX calls.
  • Run example smoke checks from examples/apple_on_device_lm.py for at least one Foundation and one Local path.

Motivation

DSPy programs that optimize prompts via BootstrapFewShot or MIPROv2 typically make
hundreds or thousands of LM calls. Running those optimization loops against cloud LLMs is
expensive and slow. Apple Silicon's unified memory makes it possible to run a capable
quantized model (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit) at sub-100ms latency
with zero API cost. Developers can now:

  1. Optimize programs locally on their Mac using AppleLocalLM.
  2. Switch the configured LM to their production cloud model for the final deployment run.
  3. Use AppleFoundationLM in shipping macOS apps that need private, on-device inference.

Changes

New files

File Description
dspy/clients/apple_fm.py AppleFoundationLM adapter + shared response types (_FMResponse, _FMUsage, _FMChoice, _FMMessage)
dspy/clients/apple_local.py AppleLocalLM adapter (MLX backend, CoreML stub), _LocalStreamChunk
tests/clients/test_apple_fm.py 44 unit tests for AppleFoundationLM
tests/clients/test_apple_local.py 25 unit tests for AppleLocalLM (includes streaming)
tests/integration/test_apple_fm_integration.py 11 Mac-only live-SDK integration tests (skip cleanly on Linux)
docs/docs/api/models/AppleFoundationLM.md API reference
docs/docs/api/models/AppleLocalLM.md API reference
examples/apple_on_device_lm.py 6 selectable runnable demos

Modified files

File Change
dspy/clients/__init__.py Guarded try/except ImportError exports for both adapters
dspy/__init__.py dspy.AppleFoundationLM, dspy.AppleLocalLM top-level exports
docs/docs/learn/programming/language_models.md Apple Foundation + Apple Silicon tabs
pyproject.toml norecursedirs guard to prevent pytest from entering experiments/

Architecture

AppleFoundationLM — Apple Intelligence (macOS 26+)

lm = dspy.AppleFoundationLM()
dspy.configure(lm=lm)
result = dspy.Predict("question -> answer")(question="What is DSPy?")

Native structured outputs. When DSPy passes response_format=SomePydanticModel (its
standard structured-output path), this adapter intercepts it before it becomes a prompt
injection. _pydantic_to_generable() maps Pydantic field constraints to Apple's
@generable constrained decoding:

  • Literal["a", "b"]fm.guide(anyOf=[...])
  • int = Field(ge=1, le=5)fm.guide(range=(1, 5))
  • str = Field(pattern=r"\d+")fm.guide(regex=...)

The model is then called with session.respond(generating=<generable_cls>), which guarantees
valid typed output at the token level — not a JSON parse of free text. The result is serialized
back to JSON so DSPy's output parser sees the same contract it would from the prompt path.

If _pydantic_to_generable() can't map a field (e.g. complex nested type), or if the Swift
grammar compiler rejects the schema at runtime, the adapter logs a warning, recreates a fresh
session, and retries without generating=, falling back gracefully to DSPy's standard
prompt-injection path.

Tool calling. Apple's SDK requires tools to be subclasses of fm.Tool, not plain
callables. _dspy_tool_to_apple_tool() dynamically subclasses fm.Tool at call time for
each DSPy tool, wiring call(**kwargs) to the DSPy callable. Generated subclasses are cached
by (tool_name, id(func)) so Apple's per-class SDK registration fires exactly once per
unique tool.

Async bridging. Apple's SDK is async-only. forward() bridges to sync via
asyncio.run() with nest_asyncio support for Jupyter notebooks.

AppleLocalLM — MLX (Apple Silicon, macOS 14+)

lm = dspy.AppleLocalLM("mlx-community/Llama-3.2-3B-Instruct-4bit", bits=4)
dspy.configure(lm=lm)

Mixed-LM pipelines. The primary use case is cheap on-device preprocessing before
expensive cloud reasoning:

local_lm = dspy.AppleLocalLM("mlx-community/Llama-3.2-3B-Instruct-4bit")
cloud_lm = dspy.LM("anthropic/claude-sonnet-4-6")

class PreprocessAndReason(dspy.Module):
    def __init__(self):
        self.extract = dspy.Predict("raw_text -> entities, dates", lm=local_lm)
        self.reason  = dspy.Predict("entities, dates -> verdict", lm=cloud_lm)

Streaming. AppleLocalLM supports dspy.streamify() via DSPy's send_stream protocol.
Wrapping any program with dspy.streamify() causes forward() to call
mlx_lm.stream_generate() and push each _LocalStreamChunk token to the stream in real time.

Concurrency gate. DSPy optimizers issue many parallel aforward() calls. Unconstrained
concurrent MLX inference jobs would exhaust Apple Silicon's unified memory pool and OOM.
aforward() gates all calls through a lazily-initialized asyncio.Semaphore(max_concurrency)
(default: 1) before offloading to asyncio.to_thread(). Users with spare RAM can raise the
limit at construction time; the adapter warns if max_concurrency > 1 since MLX
thread-safety on a single model instance is undocumented.

Context window tracking. context_window is read from
tokenizer.model_max_length (with a 4096 fallback). A warning is logged when a prompt
would exceed the window rather than silently truncating.

Shared design decisions

Explicit caching. Both adapters bypass LiteLLM, so LiteLLM's automatic caching is
unavailable. dspy.cache.get/put is wired explicitly in each forward(). The cache key
covers {model, messages, temperature, max_tokens}; DSPy-internal keys (num_retries,
stream, n) are excluded to prevent spurious misses. Unknown kwargs are warned and cleared
so they cannot silently fragment the cache.

BaseLM response contract. _FMUsage implements __iter__ to yield (key, value)
pairs so dict(response.usage) works as expected by BaseLM._process_completion.
_FMResponse carries an explicit _hidden_params={"response_cost": 0.0} field — None
cost would cause sum([None, ...]) to raise TypeError in DSPy's history aggregator.

Explicit errors over silent degradation. stream=True raises NotImplementedError
(a streaming caller expects an async generator, not a string — use dspy.streamify()).
tools=[...] raises NotImplementedError in AppleLocalLM (mlx-lm has no native tool API)
with a pointer to AppleFoundationLM for users who need tools. Unknown backends raise
ValueError.


Testing

NOTE: Tests will skip on non-mac systems causing a skip during CI.

Unit tests (Linux/WSL — zero macOS dependencies)

Each test file defines _make_fake_fm_sdk() / _make_fake_mlx_lm() factories that inject
synthetic types.ModuleType instances into sys.modules via autouse fixtures. This lets
every logical path — message flattening, Pydantic→generable conversion, tool wrapping, cache
hit/miss, concurrency gating, kwarg warn/clear, context overflow warning, ARC session
fallback, streaming chunk emission — be exercised without any Apple hardware or SDK.

tests/clients/test_apple_fm.py    — 44 tests
tests/clients/test_apple_local.py — 25 tests

Integration tests (Mac only)

tests/integration/test_apple_fm_integration.py — 11 tests

These import the real apple_fm_sdk and skip cleanly on non-macOS platforms:

===== 11 skipped in 0.04s =====   ← Linux CI
===== 11 passed in  X.XXs =====   ← macOS 26+ with Apple Intelligence

Coverage includes: live round-trip generation, structured output via @generable, tool
invocation, cache round-trip against the real dspy.cache, and AppleLocalLM mlx-lm
generation.


Design Decisions

Non-obvious implementation choices made during development.


Why fm.Tool is subclassed dynamically at runtime

Apple's SDK requires tools to be registered as subclasses of fm.Tool — you can't pass a
callable or wrap a plain function. DSPy tools, on the other hand, are arbitrary Python objects
(callables, instances with .func, or dspy.Tool wrappers). There's no static base class to
subclass at module level because fm.Tool doesn't exist until import apple_fm_sdk runs,
which can only happen on macOS 26+.

_dspy_tool_to_apple_tool() uses type() at call time to dynamically create a fresh subclass
of fm.Tool for each DSPy tool, wiring call(**kwargs) to the DSPy callable. A top-level
class _WrappedTool(fm.Tool): ... would make the entire module unimportable on Linux. Dynamic
subclassing keeps the import guard clean: the class is only created inside aforward(), which
is only reached after __init__ has already validated the platform and imported the SDK.
Generated subclasses are cached by (tool_name, id(func)) so any per-class SDK-side
registration fires exactly once per unique tool.


How we mocked an entire OS-specific SDK to get unit tests passing on Linux

apple_fm_sdk doesn't exist on Linux. mlx_lm doesn't exist outside Apple Silicon. Both are
imported lazily inside methods. Each test file defines a factory that returns a
types.ModuleType populated with hand-rolled Python stand-ins, then injects it into
sys.modules via an autouse fixture before any import of the real package can occur.

Key constraints that shaped the fakes:

  1. guide() must return "", not MagicMock. _pydantic_to_generable() passes guide
    return values as dataclass field defaults, then calls dataclasses.asdict(). If a field
    holds a MagicMock, asdict() raises TypeError: Object of type MagicMock is not JSON serializable.

  2. generable() must be a passthrough decorator. If fm.generable(cls) returns a
    MagicMock, dataclasses.make_dataclass produces a class the fake session.respond() can't
    instantiate.

  3. LanguageModelSession must be async-context-safe. The fake respond() is an
    async def that returns a plain string or a dataclass instance depending on whether
    generating= was passed.

  4. Platform checks are patched at the platform module level. Patching
    platform.system globally in the fixture means even indirect callers see "Darwin".

  5. mlx_lm.sample_utils must be registered as a real submodule in sys.modules. A flat
    types.ModuleType with an attribute sample_utils is not the same as a registered submodule
    — Python's import system resolves from mlx_lm.sample_utils import make_sampler by looking
    up "mlx_lm.sample_utils" in sys.modules directly.


Why _FMUsage implements __iter__

BaseLM calls dict(response.usage) to record token counts. This works for LiteLLM objects
because their Usage class supports the mapping protocol. _FMUsage is a plain dataclass.
Adding __iter__ to yield (key, value) pairs makes dict() work without converting
_FMUsage to a dict subclass. Cheapest fix that satisfies the contract.


Why caching lives in forward(), not BaseLM.__call__

dspy.LM gets automatic caching via LiteLLM's response cache. BaseLM subclasses that
bypass LiteLLM get no caching — BaseLM.__call__ does not cache. Both Apple adapters wire
dspy.cache.get() / put() explicitly in forward(). The cache key covers
{model, messages, temperature, max_tokens} with DSPy-internal keys (num_retries, stream,
n) excluded to prevent spurious misses.


Why response_format is intercepted in aforward(), not at the DSPy adapter layer

DSPy's ChatAdapter injects a JSON schema into the prompt for structured output requests, then
parses the text response back. Apple's SDK offers something better:
session.respond(generating=SomeGenerableClass) triggers native constrained decoding — the
model is guaranteed to emit valid tokens for that schema.

Intercepting response_format in aforward() (before it becomes a prompt injection) lets us
route Pydantic models through the native path. The result is serialized back to a JSON string
so DSPy's output parser sees exactly what it would have seen from the prompt-based path — same
contract, better reliability on small on-device models.


Per-call LanguageModelSession (stateless pattern)

Apple's LanguageModelSession is designed to maintain conversational state across turns. DSPy
modules are stateless — each forward() call is independent, and DSPy manages its own prompt
construction (including injecting few-shot examples). Reusing a session across calls would
accumulate spurious conversation history and produce wrong outputs. A new session is created on
every aforward() call; the overhead is acceptable for on-device inference (no network
round-trip).


Why aforward() uses a lazy asyncio.Semaphore, not a threading lock

DSPy optimizers evaluate candidate prompts in parallel via many concurrent aforward() calls.
Without a gate, 20 concurrent optimizer candidates submit 20 simultaneous MLX inference jobs,
each allocating activation memory in Apple Silicon's unified memory pool — instant OOM.

An asyncio.Semaphore in aforward() is the natural gate: callers suspend cooperatively
before submitting to the thread pool, so only max_concurrency blocking jobs run at a time. A
threading.Semaphore inside the sync path would also work but would block a thread-pool thread
while waiting, wasting thread resources.

The semaphore is initialized lazily on first aforward() call because asyncio.Semaphore must
be created in the event loop that will use it — __init__ often runs outside any event loop.


Why tools raises NotImplementedError in AppleLocalLM rather than being silently dropped

mlx-lm has no native tool-calling API. Silently dropping tools=[...] would let DSPy
programs appear to run successfully while actually skipping all tool invocations, producing
wrong outputs with no diagnostic. The error message points directly to AppleFoundationLM,
which has full native fm.Tool support.


Why token counts are computed from the tokenizer, and why response_cost = 0.0

Apple's Foundation Model SDK exposes no tokenizer. mlx-lm loads a HuggingFace tokenizer as
part of mlx_lm.load(). Token counts are computed by encoding the flat prompt and the
generated text with tokenizer.encode() after inference. Accurate counts matter for DSPy's
optimization budget tracking — BaseLM stores dict(response.usage) in history and optimizer
callbacks read prompt_tokens / completion_tokens to estimate cost.

response_cost = 0.0 rather than None: on-device inference has no monetary cost, but DSPy's
history aggregator sums entry["cost"] across all calls. sum([None, ...]) raises TypeError.
Setting 0.0 explicitly makes the sum safe while accurately representing zero cost.


Why unknown kwargs are warned-and-cleared instead of forwarded

Unknown kwargs (e.g. top_p=0.9) would change the cache key without changing the model output
— every unique top_p value creates a new cold cache entry for what is functionally the same
generation. Clearing them after warning prevents silent cache fragmentation and surfaces the
mismatch to users who set global dspy.configure options expecting them to apply.


Streaming strategy for AppleLocalLM

Streaming is supported via dspy.streamify() using DSPy's dspy.settings.send_stream
protocol, not via a stream=True kwarg. forward(stream=True) raises
NotImplementedError with a message directing users to streamify().

Two code paths:

  1. Primary path — forward() in anyio worker thread (via asyncify):
    streamify() wraps Predict.__call__ with asyncify, which runs it in an anyio-managed
    worker thread. Predict.forward() calls lm.forward() from that thread. When
    dspy.settings.send_stream is set, forward() calls mlx_lm.stream_generate()
    synchronously and pushes each _LocalStreamChunk to the anyio MemoryObjectSendStream via
    anyio.from_thread.run(send_stream.send, chunk).

  2. Secondary path — aforward() for direct async callers:
    When await lm.aforward() is called directly (bypassing Predict),
    _stream_generate_async() bridges mlx_lm.stream_generate() (sync) to an async generator
    via asyncio.Queue + loop.call_soon_threadsafe().

mlx_lm.stream_generate() is used rather than the lower-level generate_step() because it
is the public high-level API that handles EOS detection, max-token limits, and token decoding
internally — avoiding fragile reimplementation of per-token control logic.

_LocalStreamChunk(text, model, predict_id) is a custom dataclass, not a litellm
ModelResponseStream. DSPy's streamify() passes custom chunk types through its wildcard
branch to the caller; StreamListener field-extraction is unavailable and all tokens stream
raw.


Why session.respond(generating=...) is wrapped in try/except with session recreation

_pydantic_to_generable() can return a valid @generable class, but the underlying Swift
grammar engine can still reject the schema at inference time (e.g. a Union[str, List[str]]
field might compile without error yet fail when Apple's constrained-decoding compiler tries to
build the grammar automaton). On failure:

  1. Log a WARNING (so integration-test logs reveal schema issues).
  2. del session — the failed session may have advanced internal state.
  3. Recreate a fresh LanguageModelSession from the already-built session_kwargs.
  4. Retry with await session.respond(prompt=flat_prompt) (no generating=).

except Exception rather than TypeError / ValueError specifically: the Swift bridge
surfaces errors as Python Exception subclasses whose exact types depend on the SDK version
and are undocumented. The intent is unconditional fallback.


Why max_concurrency > 1 emits a warning instead of being hard-capped at 1

MLX's Python bindings call into a C++/Metal backend. It is undocumented whether a single
mlx.nn.Module instance supports concurrent generate() calls from multiple Python threads.
If it does not, max_concurrency > 1 can cause Metal command queue deadlocks or segfaults
with no Python traceback.

Warn rather than cap: hard-capping would deny the benefit to users who test and confirm
thread-safety on their specific hardware + MLX version, or who load separate model instances
per thread. The default is max_concurrency=1 (always safe); users who want higher throughput
opt in explicitly.


Notes for reviewers

macOS-specific setup (Apple paths only)

Use these steps only when validating AppleFoundationLM / AppleLocalLM behavior.

  1. Use Apple Silicon hardware and macOS 26+.

  2. Enable Apple Intelligence in system settings (required for AppleFoundationLM).

  3. Install DSPy development dependencies.

  4. Install mlx-lm for AppleLocalLM paths:

    pip install mlx-lm
  5. Install apple_fm_sdk from Apple's distribution channel.
    It is not currently published on PyPI, so follow Apple's installation instructions for your SDK access level.

  6. Verify imports:

    python -c "import dspy; from dspy.clients import AppleFoundationLM, AppleLocalLM; print('ok')"
  7. Verify Apple-path tests:

    pytest -vv tests/clients/test_apple_fm.py tests/clients/test_apple_local.py tests/integration/test_apple_fm_integration.py

Expected behavior on non-macOS or without apple_fm_sdk: integration tests skip and import dspy remains successful.

  • AppleLocalLM(backend="coreml") raises NotImplementedError with an invitation to
    contribute. The CoreML path is stubbed (not deleted) so the backend= parameter is
    part of the public API from day one.
  • The LanguageModelSession is intentionally created per-call (stateless pattern). DSPy
    manages all prompt construction including few-shot injection; reusing a session across calls
    would accumulate spurious conversational history.

zombat and others added 12 commits March 19, 2026 15:51
…ackends

Adds two new BaseLM subclasses for on-device inference on Apple Silicon
with zero cloud dependency:

- dspy.AppleFoundationLM — wraps Apple's LanguageModelSession (macOS 26+,
  Apple Intelligence) with native constrained decoding for structured outputs
  via @generable and full fm.Tool support for tool calling.
- dspy.AppleLocalLM — wraps mlx-lm to run any HuggingFace model on Apple
  Silicon (M1/M2/M3/M4). Supports the full DSPy optimizer workflow including
  BootstrapFewShot and MIPROv2.

Both adapters are zero-regression on Linux CI: all platform-gated imports are
lazily loaded inside try/except guards, and __all__ is extended conditionally
so from dspy.clients import * never raises AttributeError on non-macOS.

Verified on macOS 26.3.1 with Apple Intelligence:
- 64 unit tests pass (mocked SDKs, Linux-compatible)
- 11 integration tests pass against real on-device model
- Native @generable constrained decoding, tool calling, and async confirmed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mlx_lm.generate() removed the direct temperature= kwarg; temperature
is now configured via make_sampler(temp=...) from mlx_lm.sample_utils
passed as sampler=.  Update _generate() to use the new API and update
the unit-test fake module to expose a sample_utils submodule with a
stub make_sampler so all 28 tests continue to pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a 'cot' demo to apple_on_device_lm.py that loads OpenAI's
gpt-oss-20b via InferenceIllusionist/gpt-oss-20b-MLX-4bit and runs
dspy.ChainOfThought on the bat-and-ball CRT question.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements DSPy's streamify() protocol for AppleLocalLM:

- Add _LocalStreamChunk dataclass (text, model, predict_id fields) — a
  custom chunk type that passes through streamify()'s wildcard branch
- In forward(): detect dspy.settings.send_stream and switch from
  mlx_lm.generate() to mlx_lm.stream_generate(), forwarding each token
  to the anyio MemoryObjectSendStream via anyio.from_thread.run().
  This handles the primary path: streamify() -> asyncify -> anyio thread
  -> Predict.forward() -> lm.forward()
- In aforward(): detect send_stream and use an asyncio.Queue bridge
  (_stream_generate_async) for direct async callers
- Add 5 unit tests covering chunk delivery, text concatenation, and the
  forward(stream=True) guard message

Note: _LocalStreamChunk is not a litellm ModelResponseStream, so
StreamListener field-extraction is unavailable; all tokens stream raw.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 'streaming' entry to apple_on_device_lm.py showing dspy.streamify()
with AppleLocalLM — prints tokens as they arrive then shows the parsed
Prediction at the end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents conftest.py module name collision between
experiments/DSPy-AppleFM/tests/ and tests/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eanup

- All classes, methods, and functions now have docstrings with Google-style
  Args/Returns/Raises/Yields/Attributes sections
- Replace bare `import typing` with explicit `from typing import Literal`
  (used in _pydantic_to_generable) alongside existing imports
- Add return-type annotation to _FMUsage.__iter__ (Iterator[tuple[str, int]])
- Fix _FMMessage.tool_calls annotation: list -> list[Any]
- Add _hidden_params dict generic: dict -> dict[str, Any]
- Add _WrappedTool.call() docstring
- Add inline comments on non-obvious expressions (id(), call_soon_threadsafe,
  lazy semaphore init, sentinel value, ARC del session)
- No logic changes; 69/69 unit tests pass

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace stale "why stream=True raises NotImplementedError" rationale with
the actual implemented streaming design: two code paths (primary forward()
via anyio.from_thread.run in worker thread, secondary aforward() via
asyncio.Queue bridge), stream_generate() vs generate_step() decision, and
_LocalStreamChunk chunk type semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move all design-decision reasoning from TODO.md into PR-DRAFT.md under a
new "Design Decisions" section. Remove checklist items and verification
tables — the implementation is complete. TODO.md reduced to a one-line
status pointer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 19, 2026 22:21
@zombat
Copy link
Copy Markdown
Author

zombat commented Mar 19, 2026

I was multitasking, both os versions should have been 26+...

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces native Apple on-device language model backends for DSPy via two new BaseLM adapters: one for Apple’s Foundation Models SDK (Apple Intelligence) and one for local MLX (mlx-lm) inference on Apple Silicon, along with tests, docs, and an example script.

Changes:

  • Added AppleFoundationLM (Apple Intelligence / apple_fm_sdk) and AppleLocalLM (MLX / mlx-lm) adapters with caching, tool support (Foundation only), and streaming support (Local via streamify).
  • Added extensive unit tests using mocked SDK/modules and a macOS-only integration test suite for Foundation Models.
  • Updated documentation and examples to surface the new models, plus small repo config tweaks (pytest norecursedirs, gitignore).

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
dspy/clients/apple_fm.py Adds Apple Foundation Models adapter and shared OpenAI-like response types/helpers.
dspy/clients/apple_local.py Adds MLX-backed local adapter with streaming and concurrency gating.
dspy/clients/__init__.py Adds guarded exports for Apple adapters; adjusts cache configuration wiring.
tests/clients/test_apple_fm.py Unit tests for AppleFoundationLM using a mocked apple_fm_sdk.
tests/clients/test_apple_local.py Unit tests for AppleLocalLM using a mocked mlx_lm (incl. streaming).
tests/integration/test_apple_fm_integration.py Mac-only integration tests against the real Apple Foundation Models SDK.
tests/integration/__init__.py Package marker for integration tests.
examples/apple_on_device_lm.py Runnable demos for Foundation, MLX local, mixed pipelines, and streaming.
docs/docs/learn/programming/language_models.md Adds Apple Foundation + Apple Silicon (MLX) usage tabs.
docs/docs/api/models/AppleFoundationLM.md API reference stub for dspy.AppleFoundationLM.
docs/docs/api/models/AppleLocalLM.md API reference stub for dspy.AppleLocalLM.
docs/mkdocs.yml Adds Apple model docs pages to navigation.
pyproject.toml Adds pytest norecursedirs to avoid recursing into certain directories.
README.md Adds macOS extras section for Apple on-device models.
.gitignore Ignores venv/ and .venv/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zombat and others added 4 commits March 19, 2026 18:47
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>
…est guard

- Replace string concatenation with list+join in both streaming paths of
  AppleLocalLM.forward() and aforward() — avoids O(n²) for long outputs
- Use asyncio.get_running_loop() instead of deprecated get_event_loop() in
  _stream_generate_async()
- Silently discard temperature/max_tokens in AppleFoundationLM.aforward()
  before the unknown-kwargs warning — they're valid DSPy params included in
  the cache key but unsupported by Apple's SDK
- Wrap AppleFoundationLM init in integration test fixture with try/except
  RuntimeError so tests skip cleanly on Macs without Apple Intelligence

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@zombat zombat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates as per-copilot suggestions.

zombat added 2 commits March 20, 2026 11:28
When response_format is a Pydantic model, AppleLocalLM now builds an
outlines FSM logits processor and passes it to mlx_lm.generate() /
stream_generate(), guaranteeing the model output matches the schema.
Processor is cached per schema to amortise FSM compilation cost.

Falls back to prompt-only mode with a warning when outlines is not
installed (pip install 'outlines[mlxlm]'). Cache key includes a
response_schema field when a schema is active to prevent
cross-contamination with unconstrained calls.
…Error

Apple's on-device content filter raises GuardrailViolationError when it
rejects a prompt. Previously this either bubbled up as an opaque SDK
exception or got swallowed by the generable-fallback retry logic (which
would hit the same guardrail again anyway).

Now both aforward() call sites catch it by class name and immediately
re-raise a RuntimeError with a clear message directing the user to
rephrase their input. Three tests cover the plain-text path, the
@generable path, and passthrough of unrelated exceptions.
@zombat zombat changed the title Feature/apple foundation models feat(apple foundation models) Mar 20, 2026
zombat added 3 commits March 22, 2026 17:31
The 4 response dataclasses verbatim (preserving all docs/comments/style)
_flatten_messages() verbatim
_AppleBaseLM(BaseLM) abstract base with a single concrete method: a unified _build_response(text, usage=None) that both adapters can inherit
apple_fm.py edits:

Remove the 4 dataclasses + _flatten_messages (~120 lines)
Import from apple_base instead
AppleFoundationLM(BaseLM) → AppleFoundationLM(_AppleBaseLM)
Drop its _build_response override (now inherited)
Result: 530 lines
apple_local.py edits:

Swap from dspy.clients.apple_fm import ... → from dspy.clients.apple_base import ...
Remove inline from dspy.clients.apple_fm import _flatten_messages inside _apply_chat_template
AppleLocalLM(BaseLM) → AppleLocalLM(_AppleBaseLM)
Drop its _build_response override (now inherited)
Switch _LocalStreamChunk from @DataClass shorthand to @dataclasses.dataclass to match file conventions
Result: 710 lines
…ter + MLX mixin

Introduce apple_base.py as the single home for the OpenAI-compatible
response types (_FMMessage, _FMChoice, _FMUsage, _FMResponse),
_flatten_messages, _run_async, and _AppleBaseLM — eliminating the
awkward private cross-import from apple_local → apple_fm.

Extract all MLX inference internals (_MLXMixin, _LocalStreamChunk,
_apply_chat_template, _response_format_to_schema) into apple_local_mlx.py,
reducing apple_local.py to a thin public adapter. Deduplicate
_build_response (now on _AppleBaseLM), _raise_for_guardrail (static
method on _AppleBaseLM, replaces two identical inline blocks in
aforward), and token-counting/_FMUsage construction (_compute_usage
on AppleLocalLM, used by both forward and aforward).

Line counts before → after:
  apple_fm.py        652 → 467
  apple_local.py     742 → 473
  apple_base.py        — → 241  (new)
  apple_local_mlx.py   — → 300  (new)

Also: remove unused imports flagged by ruff, convert dict() calls to
literals (C408), re-export _FM* types from apple_fm for import
compatibility.

No behaviour changes; 80/80 tests pass.
@zombat
Copy link
Copy Markdown
Author

zombat commented Mar 22, 2026

the lazy import approach with try/except in init.py is the right pattern for platform-gated dependencies. a few things: (1) configure_cache at line 37 shadows the module-level DSPY_CACHE with a local dspy_cache — that rename is unrelated to the Apple feature and should probably be a separate commit. (2) the apple_fm.py and apple_local.py files are 650+ lines each. might be worth splitting the response dataclasses (_FMMessage, _FMChoice, etc) into a shared module since both adapters need the same OpenAI-compatible response shape. (3) test coverage is strong (69 tests) but all integration tests skip on non-macOS — worth noting in the PR that CI will never run them.

Great catches. It honestly started to get away from me a bit as it came together, but it's been broken up now. 1, and 2 done. 3 ongoing...

zombat added 2 commits March 22, 2026 19:25
…call fields

litellm 1.82.4 (Pydantic V2) raises AttributeError when ModelResponse.usage
is not set; fall back to {} instead of crashing. Also strip None-valued fields
from Responses API tool call dicts — 1.82.4 adds namespace=None to
ResponseFunctionToolCall, breaking dict equality assertions.
@zombat
Copy link
Copy Markdown
Author

zombat commented Mar 23, 2026

#3 complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants