feat(apple foundation models) by zombat · Pull Request #9473 · stanfordnlp/dspy

zombat · 2026-03-19T22:21:26Z

feat: Add `AppleFoundationLM` and `AppleLocalLM` — native Apple Silicon & Apple Intelligence backends

Summary

This PR adds two new BaseLM subclasses that allow DSPy programs to run entirely on-device
on Apple Silicon, with no cloud dependency:

dspy.AppleFoundationLM — wraps Apple's LanguageModelSession (macOS 26+, Apple
Intelligence) with native constrained decoding for structured outputs and full fm.Tool
support for tool calling.
dspy.AppleLocalLM — wraps mlx-lm to run any HuggingFace model natively on Apple
Silicon. Supports the full DSPy optimizer workflow including BootstrapFewShot and
MIPROv2.

Both adapters are zero-regression on Linux CI: all platform-gated imports are lazily
loaded and wrapped in try/except guards. The entire test suite (69 unit tests) passes on
Linux/WSL with no macOS dependencies.

Reviewer Note: Feature-Pass Evidence vs Pre-Existing Failures

Feature-pass evidence (this PR path)

pre-commit run --all-files passes on this branch.
Branch-scoped Ruff checks pass for files changed in this PR.
Apple-path tests pass locally:
- tests/clients/test_apple_fm.py
- tests/clients/test_apple_local.py
- tests/integration/test_apple_fm_integration.py
- Result: 80 passed, 0 failed.

Pre-existing failures (not introduced by this PR)

Full-suite local run (pytest -vv tests/) reports failures outside Apple-path files.
Dominant failure pattern is baseline adapter/client behavior (for example,
AttributeError: 'ModelResponse' object has no attribute 'usage') in non-Apple tests.
These failures are treated as repository baseline issues and are out of scope for this Apple adapter PR.

Apple-Only Reviewer Checklist

Confirm import safety on non-macOS: import dspy succeeds when Apple SDKs are absent.
Verify dspy.AppleFoundationLM init guards: clear errors for non-macOS / missing SDK / unavailable model.
Verify dspy.AppleLocalLM init guards: clear errors for non-macOS, Intel Mac, missing mlx-lm.
Validate structured output path for AppleFoundationLM:
- Pydantic response_format maps to native constrained decoding when supported.
- Fallback path logs warning and still returns valid DSPy-parseable output.
Validate tool-calling behavior:
- AppleFoundationLM executes DSPy tools via fm.Tool wrapping.
- AppleLocalLM rejects tools=[...] with explicit NotImplementedError.
Validate streaming behavior:
- AppleLocalLM supports dspy.streamify() token streaming.
- stream=True direct call behavior remains explicit and documented.
Validate cache behavior on both adapters:
- Cache hit short-circuits model call.
- Cache key excludes DSPy-internal noise fields.
Validate concurrency gate in AppleLocalLM (max_concurrency) prevents uncontrolled parallel MLX calls.
Run example smoke checks from examples/apple_on_device_lm.py for at least one Foundation and one Local path.

Motivation

DSPy programs that optimize prompts via BootstrapFewShot or MIPROv2 typically make
hundreds or thousands of LM calls. Running those optimization loops against cloud LLMs is
expensive and slow. Apple Silicon's unified memory makes it possible to run a capable
quantized model (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit) at sub-100ms latency
with zero API cost. Developers can now:

Optimize programs locally on their Mac using AppleLocalLM.
Switch the configured LM to their production cloud model for the final deployment run.
Use AppleFoundationLM in shipping macOS apps that need private, on-device inference.

Changes

New files

File	Description
`dspy/clients/apple_fm.py`	`AppleFoundationLM` adapter + shared response types (`_FMResponse`, `_FMUsage`, `_FMChoice`, `_FMMessage`)
`dspy/clients/apple_local.py`	`AppleLocalLM` adapter (MLX backend, CoreML stub), `_LocalStreamChunk`
`tests/clients/test_apple_fm.py`	44 unit tests for `AppleFoundationLM`
`tests/clients/test_apple_local.py`	25 unit tests for `AppleLocalLM` (includes streaming)
`tests/integration/test_apple_fm_integration.py`	11 Mac-only live-SDK integration tests (skip cleanly on Linux)
`docs/docs/api/models/AppleFoundationLM.md`	API reference
`docs/docs/api/models/AppleLocalLM.md`	API reference
`examples/apple_on_device_lm.py`	6 selectable runnable demos

Modified files

File	Change
`dspy/clients/__init__.py`	Guarded `try/except ImportError` exports for both adapters
`dspy/__init__.py`	`dspy.AppleFoundationLM`, `dspy.AppleLocalLM` top-level exports
`docs/docs/learn/programming/language_models.md`	Apple Foundation + Apple Silicon tabs
`pyproject.toml`	`norecursedirs` guard to prevent pytest from entering `experiments/`

Architecture

`AppleFoundationLM` — Apple Intelligence (macOS 26+)

lm = dspy.AppleFoundationLM()
dspy.configure(lm=lm)
result = dspy.Predict("question -> answer")(question="What is DSPy?")

Native structured outputs. When DSPy passes response_format=SomePydanticModel (its
standard structured-output path), this adapter intercepts it before it becomes a prompt
injection. _pydantic_to_generable() maps Pydantic field constraints to Apple's
@generable constrained decoding:

Literal["a", "b"] → fm.guide(anyOf=[...])
int = Field(ge=1, le=5) → fm.guide(range=(1, 5))
str = Field(pattern=r"\d+") → fm.guide(regex=...)

The model is then called with session.respond(generating=<generable_cls>), which guarantees
valid typed output at the token level — not a JSON parse of free text. The result is serialized
back to JSON so DSPy's output parser sees the same contract it would from the prompt path.

If _pydantic_to_generable() can't map a field (e.g. complex nested type), or if the Swift
grammar compiler rejects the schema at runtime, the adapter logs a warning, recreates a fresh
session, and retries without generating=, falling back gracefully to DSPy's standard
prompt-injection path.

Tool calling. Apple's SDK requires tools to be subclasses of fm.Tool, not plain
callables. _dspy_tool_to_apple_tool() dynamically subclasses fm.Tool at call time for
each DSPy tool, wiring call(**kwargs) to the DSPy callable. Generated subclasses are cached
by (tool_name, id(func)) so Apple's per-class SDK registration fires exactly once per
unique tool.

Async bridging. Apple's SDK is async-only. forward() bridges to sync via
asyncio.run() with nest_asyncio support for Jupyter notebooks.

`AppleLocalLM` — MLX (Apple Silicon, macOS 14+)

lm = dspy.AppleLocalLM("mlx-community/Llama-3.2-3B-Instruct-4bit", bits=4)
dspy.configure(lm=lm)

Mixed-LM pipelines. The primary use case is cheap on-device preprocessing before
expensive cloud reasoning:

local_lm = dspy.AppleLocalLM("mlx-community/Llama-3.2-3B-Instruct-4bit")
cloud_lm = dspy.LM("anthropic/claude-sonnet-4-6")

class PreprocessAndReason(dspy.Module):
    def __init__(self):
        self.extract = dspy.Predict("raw_text -> entities, dates", lm=local_lm)
        self.reason  = dspy.Predict("entities, dates -> verdict", lm=cloud_lm)

Streaming. AppleLocalLM supports dspy.streamify() via DSPy's send_stream protocol.
Wrapping any program with dspy.streamify() causes forward() to call
mlx_lm.stream_generate() and push each _LocalStreamChunk token to the stream in real time.

Concurrency gate. DSPy optimizers issue many parallel aforward() calls. Unconstrained
concurrent MLX inference jobs would exhaust Apple Silicon's unified memory pool and OOM.
aforward() gates all calls through a lazily-initialized asyncio.Semaphore(max_concurrency)
(default: 1) before offloading to asyncio.to_thread(). Users with spare RAM can raise the
limit at construction time; the adapter warns if max_concurrency > 1 since MLX
thread-safety on a single model instance is undocumented.

Context window tracking. context_window is read from
tokenizer.model_max_length (with a 4096 fallback). A warning is logged when a prompt
would exceed the window rather than silently truncating.

Shared design decisions

Explicit caching. Both adapters bypass LiteLLM, so LiteLLM's automatic caching is
unavailable. dspy.cache.get/put is wired explicitly in each forward(). The cache key
covers {model, messages, temperature, max_tokens}; DSPy-internal keys (num_retries,
stream, n) are excluded to prevent spurious misses. Unknown kwargs are warned and cleared
so they cannot silently fragment the cache.

BaseLM response contract. _FMUsage implements __iter__ to yield (key, value)
pairs so dict(response.usage) works as expected by BaseLM._process_completion.
_FMResponse carries an explicit _hidden_params={"response_cost": 0.0} field — None
cost would cause sum([None, ...]) to raise TypeError in DSPy's history aggregator.

Explicit errors over silent degradation. stream=True raises NotImplementedError
(a streaming caller expects an async generator, not a string — use dspy.streamify()).
tools=[...] raises NotImplementedError in AppleLocalLM (mlx-lm has no native tool API)
with a pointer to AppleFoundationLM for users who need tools. Unknown backends raise
ValueError.

Testing

NOTE: Tests will skip on non-mac systems causing a skip during CI.

Unit tests (Linux/WSL — zero macOS dependencies)

Each test file defines _make_fake_fm_sdk() / _make_fake_mlx_lm() factories that inject
synthetic types.ModuleType instances into sys.modules via autouse fixtures. This lets
every logical path — message flattening, Pydantic→generable conversion, tool wrapping, cache
hit/miss, concurrency gating, kwarg warn/clear, context overflow warning, ARC session
fallback, streaming chunk emission — be exercised without any Apple hardware or SDK.

tests/clients/test_apple_fm.py    — 44 tests
tests/clients/test_apple_local.py — 25 tests

Integration tests (Mac only)

tests/integration/test_apple_fm_integration.py — 11 tests

These import the real apple_fm_sdk and skip cleanly on non-macOS platforms:

===== 11 skipped in 0.04s =====   ← Linux CI
===== 11 passed in  X.XXs =====   ← macOS 26+ with Apple Intelligence

Coverage includes: live round-trip generation, structured output via @generable, tool
invocation, cache round-trip against the real dspy.cache, and AppleLocalLM mlx-lm
generation.

Design Decisions

Non-obvious implementation choices made during development.

Why `fm.Tool` is subclassed dynamically at runtime

Apple's SDK requires tools to be registered as subclasses of fm.Tool — you can't pass a
callable or wrap a plain function. DSPy tools, on the other hand, are arbitrary Python objects
(callables, instances with .func, or dspy.Tool wrappers). There's no static base class to
subclass at module level because fm.Tool doesn't exist until import apple_fm_sdk runs,
which can only happen on macOS 26+.

_dspy_tool_to_apple_tool() uses type() at call time to dynamically create a fresh subclass
of fm.Tool for each DSPy tool, wiring call(**kwargs) to the DSPy callable. A top-level
class _WrappedTool(fm.Tool): ... would make the entire module unimportable on Linux. Dynamic
subclassing keeps the import guard clean: the class is only created inside aforward(), which
is only reached after __init__ has already validated the platform and imported the SDK.
Generated subclasses are cached by (tool_name, id(func)) so any per-class SDK-side
registration fires exactly once per unique tool.

How we mocked an entire OS-specific SDK to get unit tests passing on Linux

apple_fm_sdk doesn't exist on Linux. mlx_lm doesn't exist outside Apple Silicon. Both are
imported lazily inside methods. Each test file defines a factory that returns a
types.ModuleType populated with hand-rolled Python stand-ins, then injects it into
sys.modules via an autouse fixture before any import of the real package can occur.

Key constraints that shaped the fakes:

guide() must return "", not MagicMock. _pydantic_to_generable() passes guide
return values as dataclass field defaults, then calls dataclasses.asdict(). If a field
holds a MagicMock, asdict() raises TypeError: Object of type MagicMock is not JSON serializable.
generable() must be a passthrough decorator. If fm.generable(cls) returns a
MagicMock, dataclasses.make_dataclass produces a class the fake session.respond() can't
instantiate.
LanguageModelSession must be async-context-safe. The fake respond() is an
async def that returns a plain string or a dataclass instance depending on whether
generating= was passed.
Platform checks are patched at the platform module level. Patching
platform.system globally in the fixture means even indirect callers see "Darwin".
mlx_lm.sample_utils must be registered as a real submodule in sys.modules. A flat
types.ModuleType with an attribute sample_utils is not the same as a registered submodule
— Python's import system resolves from mlx_lm.sample_utils import make_sampler by looking
up "mlx_lm.sample_utils" in sys.modules directly.

Why `_FMUsage` implements `iter`

BaseLM calls dict(response.usage) to record token counts. This works for LiteLLM objects
because their Usage class supports the mapping protocol. _FMUsage is a plain dataclass.
Adding __iter__ to yield (key, value) pairs makes dict() work without converting
_FMUsage to a dict subclass. Cheapest fix that satisfies the contract.

Why caching lives in `forward()`, not `BaseLM.call`

dspy.LM gets automatic caching via LiteLLM's response cache. BaseLM subclasses that
bypass LiteLLM get no caching — BaseLM.__call__ does not cache. Both Apple adapters wire
dspy.cache.get() / put() explicitly in forward(). The cache key covers
{model, messages, temperature, max_tokens} with DSPy-internal keys (num_retries, stream,
n) excluded to prevent spurious misses.

Why `response_format` is intercepted in `aforward()`, not at the DSPy adapter layer

DSPy's ChatAdapter injects a JSON schema into the prompt for structured output requests, then
parses the text response back. Apple's SDK offers something better:
session.respond(generating=SomeGenerableClass) triggers native constrained decoding — the
model is guaranteed to emit valid tokens for that schema.

Intercepting response_format in aforward() (before it becomes a prompt injection) lets us
route Pydantic models through the native path. The result is serialized back to a JSON string
so DSPy's output parser sees exactly what it would have seen from the prompt-based path — same
contract, better reliability on small on-device models.

Per-call `LanguageModelSession` (stateless pattern)

Apple's LanguageModelSession is designed to maintain conversational state across turns. DSPy
modules are stateless — each forward() call is independent, and DSPy manages its own prompt
construction (including injecting few-shot examples). Reusing a session across calls would
accumulate spurious conversation history and produce wrong outputs. A new session is created on
every aforward() call; the overhead is acceptable for on-device inference (no network
round-trip).

Why `aforward()` uses a lazy `asyncio.Semaphore`, not a threading lock

DSPy optimizers evaluate candidate prompts in parallel via many concurrent aforward() calls.
Without a gate, 20 concurrent optimizer candidates submit 20 simultaneous MLX inference jobs,
each allocating activation memory in Apple Silicon's unified memory pool — instant OOM.

An asyncio.Semaphore in aforward() is the natural gate: callers suspend cooperatively
before submitting to the thread pool, so only max_concurrency blocking jobs run at a time. A
threading.Semaphore inside the sync path would also work but would block a thread-pool thread
while waiting, wasting thread resources.

The semaphore is initialized lazily on first aforward() call because asyncio.Semaphore must
be created in the event loop that will use it — __init__ often runs outside any event loop.

Why `tools` raises `NotImplementedError` in `AppleLocalLM` rather than being silently dropped

mlx-lm has no native tool-calling API. Silently dropping tools=[...] would let DSPy
programs appear to run successfully while actually skipping all tool invocations, producing
wrong outputs with no diagnostic. The error message points directly to AppleFoundationLM,
which has full native fm.Tool support.

Why token counts are computed from the tokenizer, and why `response_cost = 0.0`

Apple's Foundation Model SDK exposes no tokenizer. mlx-lm loads a HuggingFace tokenizer as
part of mlx_lm.load(). Token counts are computed by encoding the flat prompt and the
generated text with tokenizer.encode() after inference. Accurate counts matter for DSPy's
optimization budget tracking — BaseLM stores dict(response.usage) in history and optimizer
callbacks read prompt_tokens / completion_tokens to estimate cost.

response_cost = 0.0 rather than None: on-device inference has no monetary cost, but DSPy's
history aggregator sums entry["cost"] across all calls. sum([None, ...]) raises TypeError.
Setting 0.0 explicitly makes the sum safe while accurately representing zero cost.

Why unknown kwargs are warned-and-cleared instead of forwarded

Unknown kwargs (e.g. top_p=0.9) would change the cache key without changing the model output
— every unique top_p value creates a new cold cache entry for what is functionally the same
generation. Clearing them after warning prevents silent cache fragmentation and surfaces the
mismatch to users who set global dspy.configure options expecting them to apply.

Streaming strategy for `AppleLocalLM`

Streaming is supported via dspy.streamify() using DSPy's dspy.settings.send_stream
protocol, not via a stream=True kwarg. forward(stream=True) raises
NotImplementedError with a message directing users to streamify().

Two code paths:

Primary path — forward() in anyio worker thread (via asyncify):
streamify() wraps Predict.__call__ with asyncify, which runs it in an anyio-managed
worker thread. Predict.forward() calls lm.forward() from that thread. When
dspy.settings.send_stream is set, forward() calls mlx_lm.stream_generate()
synchronously and pushes each _LocalStreamChunk to the anyio MemoryObjectSendStream via
anyio.from_thread.run(send_stream.send, chunk).
Secondary path — aforward() for direct async callers:
When await lm.aforward() is called directly (bypassing Predict),
_stream_generate_async() bridges mlx_lm.stream_generate() (sync) to an async generator
via asyncio.Queue + loop.call_soon_threadsafe().

mlx_lm.stream_generate() is used rather than the lower-level generate_step() because it
is the public high-level API that handles EOS detection, max-token limits, and token decoding
internally — avoiding fragile reimplementation of per-token control logic.

_LocalStreamChunk(text, model, predict_id) is a custom dataclass, not a litellm
ModelResponseStream. DSPy's streamify() passes custom chunk types through its wildcard
branch to the caller; StreamListener field-extraction is unavailable and all tokens stream
raw.

Why `session.respond(generating=...)` is wrapped in `try/except` with session recreation

_pydantic_to_generable() can return a valid @generable class, but the underlying Swift
grammar engine can still reject the schema at inference time (e.g. a Union[str, List[str]]
field might compile without error yet fail when Apple's constrained-decoding compiler tries to
build the grammar automaton). On failure:

Log a WARNING (so integration-test logs reveal schema issues).
del session — the failed session may have advanced internal state.
Recreate a fresh LanguageModelSession from the already-built session_kwargs.
Retry with await session.respond(prompt=flat_prompt) (no generating=).

except Exception rather than TypeError / ValueError specifically: the Swift bridge
surfaces errors as Python Exception subclasses whose exact types depend on the SDK version
and are undocumented. The intent is unconditional fallback.

Why `max_concurrency > 1` emits a warning instead of being hard-capped at 1

MLX's Python bindings call into a C++/Metal backend. It is undocumented whether a single
mlx.nn.Module instance supports concurrent generate() calls from multiple Python threads.
If it does not, max_concurrency > 1 can cause Metal command queue deadlocks or segfaults
with no Python traceback.

Warn rather than cap: hard-capping would deny the benefit to users who test and confirm
thread-safety on their specific hardware + MLX version, or who load separate model instances
per thread. The default is max_concurrency=1 (always safe); users who want higher throughput
opt in explicitly.

Notes for reviewers

macOS-specific setup (Apple paths only)

Use these steps only when validating AppleFoundationLM / AppleLocalLM behavior.

Use Apple Silicon hardware and macOS 26+.
Enable Apple Intelligence in system settings (required for AppleFoundationLM).
Install DSPy development dependencies.
Install mlx-lm for AppleLocalLM paths:
```
pip install mlx-lm
```
Install apple_fm_sdk from Apple's distribution channel.
It is not currently published on PyPI, so follow Apple's installation instructions for your SDK access level.

Verify imports:

python -c "import dspy; from dspy.clients import AppleFoundationLM, AppleLocalLM; print('ok')"

Verify Apple-path tests:

pytest -vv tests/clients/test_apple_fm.py tests/clients/test_apple_local.py tests/integration/test_apple_fm_integration.py

Expected behavior on non-macOS or without apple_fm_sdk: integration tests skip and import dspy remains successful.

AppleLocalLM(backend="coreml") raises NotImplementedError with an invitation to
contribute. The CoreML path is stubbed (not deleted) so the backend= parameter is
part of the public API from day one.
The LanguageModelSession is intentionally created per-call (stateless pattern). DSPy
manages all prompt construction including few-shot injection; reusing a session across calls
would accumulate spurious conversational history.

…ackends Adds two new BaseLM subclasses for on-device inference on Apple Silicon with zero cloud dependency: - dspy.AppleFoundationLM — wraps Apple's LanguageModelSession (macOS 26+, Apple Intelligence) with native constrained decoding for structured outputs via @generable and full fm.Tool support for tool calling. - dspy.AppleLocalLM — wraps mlx-lm to run any HuggingFace model on Apple Silicon (M1/M2/M3/M4). Supports the full DSPy optimizer workflow including BootstrapFewShot and MIPROv2. Both adapters are zero-regression on Linux CI: all platform-gated imports are lazily loaded inside try/except guards, and __all__ is extended conditionally so from dspy.clients import * never raises AttributeError on non-macOS. Verified on macOS 26.3.1 with Apple Intelligence: - 64 unit tests pass (mocked SDKs, Linux-compatible) - 11 integration tests pass against real on-device model - Native @generable constrained decoding, tool calling, and async confirmed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mlx_lm.generate() removed the direct temperature= kwarg; temperature is now configured via make_sampler(temp=...) from mlx_lm.sample_utils passed as sampler=. Update _generate() to use the new API and update the unit-test fake module to expose a sample_utils submodule with a stub make_sampler so all 28 tests continue to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a 'cot' demo to apple_on_device_lm.py that loads OpenAI's gpt-oss-20b via InferenceIllusionist/gpt-oss-20b-MLX-4bit and runs dspy.ChainOfThought on the bat-and-ball CRT question. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements DSPy's streamify() protocol for AppleLocalLM: - Add _LocalStreamChunk dataclass (text, model, predict_id fields) — a custom chunk type that passes through streamify()'s wildcard branch - In forward(): detect dspy.settings.send_stream and switch from mlx_lm.generate() to mlx_lm.stream_generate(), forwarding each token to the anyio MemoryObjectSendStream via anyio.from_thread.run(). This handles the primary path: streamify() -> asyncify -> anyio thread -> Predict.forward() -> lm.forward() - In aforward(): detect send_stream and use an asyncio.Queue bridge (_stream_generate_async) for direct async callers - Add 5 unit tests covering chunk delivery, text concatenation, and the forward(stream=True) guard message Note: _LocalStreamChunk is not a litellm ModelResponseStream, so StreamListener field-extraction is unavailable; all tokens stream raw. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds 'streaming' entry to apple_on_device_lm.py showing dspy.streamify() with AppleLocalLM — prints tokens as they arrive then shows the parsed Prediction at the end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Prevents conftest.py module name collision between experiments/DSPy-AppleFM/tests/ and tests/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eanup - All classes, methods, and functions now have docstrings with Google-style Args/Returns/Raises/Yields/Attributes sections - Replace bare `import typing` with explicit `from typing import Literal` (used in _pydantic_to_generable) alongside existing imports - Add return-type annotation to _FMUsage.__iter__ (Iterator[tuple[str, int]]) - Fix _FMMessage.tool_calls annotation: list -> list[Any] - Add _hidden_params dict generic: dict -> dict[str, Any] - Add _WrappedTool.call() docstring - Add inline comments on non-obvious expressions (id(), call_soon_threadsafe, lazy semaphore init, sentinel value, ARC del session) - No logic changes; 69/69 unit tests pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace stale "why stream=True raises NotImplementedError" rationale with the actual implemented streaming design: two code paths (primary forward() via anyio.from_thread.run in worker thread, secondary aforward() via asyncio.Queue bridge), stream_generate() vs generate_step() decision, and _LocalStreamChunk chunk type semantics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move all design-decision reasoning from TODO.md into PR-DRAFT.md under a new "Design Decisions" section. Remove checklist items and verification tables — the implementation is complete. TODO.md reduced to a one-line status pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zombat · 2026-03-19T22:23:34Z

I was multitasking, both os versions should have been 26+...

Copilot

Pull request overview

This PR introduces native Apple on-device language model backends for DSPy via two new BaseLM adapters: one for Apple’s Foundation Models SDK (Apple Intelligence) and one for local MLX (mlx-lm) inference on Apple Silicon, along with tests, docs, and an example script.

Changes:

Added AppleFoundationLM (Apple Intelligence / apple_fm_sdk) and AppleLocalLM (MLX / mlx-lm) adapters with caching, tool support (Foundation only), and streaming support (Local via streamify).
Added extensive unit tests using mocked SDK/modules and a macOS-only integration test suite for Foundation Models.
Updated documentation and examples to surface the new models, plus small repo config tweaks (pytest norecursedirs, gitignore).

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`dspy/clients/apple_fm.py`	Adds Apple Foundation Models adapter and shared OpenAI-like response types/helpers.
`dspy/clients/apple_local.py`	Adds MLX-backed local adapter with streaming and concurrency gating.
`dspy/clients/__init__.py`	Adds guarded exports for Apple adapters; adjusts cache configuration wiring.
`tests/clients/test_apple_fm.py`	Unit tests for AppleFoundationLM using a mocked `apple_fm_sdk`.
`tests/clients/test_apple_local.py`	Unit tests for AppleLocalLM using a mocked `mlx_lm` (incl. streaming).
`tests/integration/test_apple_fm_integration.py`	Mac-only integration tests against the real Apple Foundation Models SDK.
`tests/integration/__init__.py`	Package marker for integration tests.
`examples/apple_on_device_lm.py`	Runnable demos for Foundation, MLX local, mixed pipelines, and streaming.
`docs/docs/learn/programming/language_models.md`	Adds Apple Foundation + Apple Silicon (MLX) usage tabs.
`docs/docs/api/models/AppleFoundationLM.md`	API reference stub for `dspy.AppleFoundationLM`.
`docs/docs/api/models/AppleLocalLM.md`	API reference stub for `dspy.AppleLocalLM`.
`docs/mkdocs.yml`	Adds Apple model docs pages to navigation.
`pyproject.toml`	Adds pytest `norecursedirs` to avoid recursing into certain directories.
`README.md`	Adds macOS extras section for Apple on-device models.
`.gitignore`	Ignores `venv/` and `.venv/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dspy/clients/apple_local.py

examples/apple_on_device_lm.py

dspy/clients/__init__.py

dspy/clients/apple_fm.py

dspy/clients/apple_local.py

docs/docs/learn/programming/language_models.md

tests/integration/test_apple_fm_integration.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>

…est guard - Replace string concatenation with list+join in both streaming paths of AppleLocalLM.forward() and aforward() — avoids O(n²) for long outputs - Use asyncio.get_running_loop() instead of deprecated get_event_loop() in _stream_generate_async() - Silently discard temperature/max_tokens in AppleFoundationLM.aforward() before the unknown-kwargs warning — they're valid DSPy params included in the cache key but unsupported by Apple's SDK - Wrap AppleFoundationLM init in integration test fixture with try/except RuntimeError so tests skip cleanly on Macs without Apple Intelligence Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zombat

Updates as per-copilot suggestions.

When response_format is a Pydantic model, AppleLocalLM now builds an outlines FSM logits processor and passes it to mlx_lm.generate() / stream_generate(), guaranteeing the model output matches the schema. Processor is cached per schema to amortise FSM compilation cost. Falls back to prompt-only mode with a warning when outlines is not installed (pip install 'outlines[mlxlm]'). Cache key includes a response_schema field when a schema is active to prevent cross-contamination with unconstrained calls.

…Error Apple's on-device content filter raises GuardrailViolationError when it rejects a prompt. Previously this either bubbled up as an opaque SDK exception or got swallowed by the generable-fallback retry logic (which would hit the same guardrail again anyway). Now both aforward() call sites catch it by class name and immediately re-raise a RuntimeError with a clear message directing the user to rephrase their input. Three tests cover the plain-text path, the @generable path, and passthrough of unrelated exceptions.

The 4 response dataclasses verbatim (preserving all docs/comments/style) _flatten_messages() verbatim _AppleBaseLM(BaseLM) abstract base with a single concrete method: a unified _build_response(text, usage=None) that both adapters can inherit apple_fm.py edits: Remove the 4 dataclasses + _flatten_messages (~120 lines) Import from apple_base instead AppleFoundationLM(BaseLM) → AppleFoundationLM(_AppleBaseLM) Drop its _build_response override (now inherited) Result: 530 lines apple_local.py edits: Swap from dspy.clients.apple_fm import ... → from dspy.clients.apple_base import ... Remove inline from dspy.clients.apple_fm import _flatten_messages inside _apply_chat_template AppleLocalLM(BaseLM) → AppleLocalLM(_AppleBaseLM) Drop its _build_response override (now inherited) Switch _LocalStreamChunk from @DataClass shorthand to @dataclasses.dataclass to match file conventions Result: 710 lines

…ter + MLX mixin Introduce apple_base.py as the single home for the OpenAI-compatible response types (_FMMessage, _FMChoice, _FMUsage, _FMResponse), _flatten_messages, _run_async, and _AppleBaseLM — eliminating the awkward private cross-import from apple_local → apple_fm. Extract all MLX inference internals (_MLXMixin, _LocalStreamChunk, _apply_chat_template, _response_format_to_schema) into apple_local_mlx.py, reducing apple_local.py to a thin public adapter. Deduplicate _build_response (now on _AppleBaseLM), _raise_for_guardrail (static method on _AppleBaseLM, replaces two identical inline blocks in aforward), and token-counting/_FMUsage construction (_compute_usage on AppleLocalLM, used by both forward and aforward). Line counts before → after: apple_fm.py 652 → 467 apple_local.py 742 → 473 apple_base.py — → 241 (new) apple_local_mlx.py — → 300 (new) Also: remove unused imports flagged by ruff, convert dict() calls to literals (C408), re-export _FM* types from apple_fm for import compatibility. No behaviour changes; 80/80 tests pass.

zombat · 2026-03-22T22:57:21Z

the lazy import approach with try/except in init.py is the right pattern for platform-gated dependencies. a few things: (1) configure_cache at line 37 shadows the module-level DSPY_CACHE with a local dspy_cache — that rename is unrelated to the Apple feature and should probably be a separate commit. (2) the apple_fm.py and apple_local.py files are 650+ lines each. might be worth splitting the response dataclasses (_FMMessage, _FMChoice, etc) into a shared module since both adapters need the same OpenAI-compatible response shape. (3) test coverage is strong (69 tests) but all integration tests skip on non-macOS — worth noting in the PR that CI will never run them.

Great catches. It honestly started to get away from me a bit as it came together, but it's been broken up now. 1, and 2 done. 3 ongoing...

…call fields litellm 1.82.4 (Pydantic V2) raises AttributeError when ModelResponse.usage is not set; fall back to {} instead of crashing. Also strip None-valued fields from Responses API tool call dicts — 1.82.4 adds namespace=None to ResponseFunctionToolCall, breaking dict equality assertions.

zombat · 2026-03-23T00:00:19Z

#3 complete

zombat and others added 12 commits March 19, 2026 15:51

examples(apple): add streaming demo

fc5c6d4

Adds 'streaming' entry to apple_on_device_lm.py showing dspy.streamify() with AppleLocalLM — prints tokens as they arrive then shows the parsed Prediction at the end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: exclude experiments/ from pytest discovery

ddce4c1

Prevents conftest.py module name collision between experiments/DSPy-AppleFM/tests/ and tests/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update macOS version requirement to 26+ in Apple LM example

6a22a58

style(apple): apply ruff cleanup to clients, tests, and example

30ffde1

docs: update README Apple setup and remove PR draft

1101c22

Copilot AI review requested due to automatic review settings March 19, 2026 22:21

Copilot started reviewing on behalf of zombat March 19, 2026 22:21 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

zombat and others added 4 commits March 19, 2026 18:47

Update docs/docs/learn/programming/language_models.md

16564d8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>

Update dspy/clients/apple_local.py

5a6aac6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>

Update dspy/clients/apple_fm.py

89c20bf

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>

zombat commented Mar 19, 2026

View reviewed changes

zombat added 2 commits March 20, 2026 11:28

zombat changed the title ~~Feature/apple foundation models~~ feat(apple foundation models) Mar 20, 2026

zombat added 3 commits March 22, 2026 17:31

reverted unrelated cache cosnt rename

25ae40c

zombat added 2 commits March 22, 2026 19:25

adding mlx specific functions now that I'm happy w/ the tests

e50c748

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apple foundation models)#9473

feat(apple foundation models)#9473
zombat wants to merge 23 commits intostanfordnlp:mainfrom
zombat:feature/apple-foundation-models

zombat commented Mar 19, 2026 •

edited

Loading

Uh oh!

zombat commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zombat left a comment

Uh oh!

zombat commented Mar 22, 2026 •

edited

Loading

Uh oh!

zombat commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zombat commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: Add AppleFoundationLM and AppleLocalLM — native Apple Silicon & Apple Intelligence backends

Summary

Reviewer Note: Feature-Pass Evidence vs Pre-Existing Failures

Feature-pass evidence (this PR path)

Pre-existing failures (not introduced by this PR)

Apple-Only Reviewer Checklist

Motivation

Changes

New files

Modified files

Architecture

AppleFoundationLM — Apple Intelligence (macOS 26+)

AppleLocalLM — MLX (Apple Silicon, macOS 14+)

Shared design decisions

Testing

Unit tests (Linux/WSL — zero macOS dependencies)

Integration tests (Mac only)

Design Decisions

Why fm.Tool is subclassed dynamically at runtime

How we mocked an entire OS-specific SDK to get unit tests passing on Linux

Why _FMUsage implements __iter__

Why caching lives in forward(), not BaseLM.__call__

Why response_format is intercepted in aforward(), not at the DSPy adapter layer

Per-call LanguageModelSession (stateless pattern)

Why aforward() uses a lazy asyncio.Semaphore, not a threading lock

Why tools raises NotImplementedError in AppleLocalLM rather than being silently dropped

Why token counts are computed from the tokenizer, and why response_cost = 0.0

Why unknown kwargs are warned-and-cleared instead of forwarded

Streaming strategy for AppleLocalLM

Why session.respond(generating=...) is wrapped in try/except with session recreation

Why max_concurrency > 1 emits a warning instead of being hard-capped at 1

Notes for reviewers

macOS-specific setup (Apple paths only)

Uh oh!

zombat commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zombat left a comment

Choose a reason for hiding this comment

Uh oh!

zombat commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zombat commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zombat commented Mar 19, 2026 •

edited

Loading

feat: Add `AppleFoundationLM` and `AppleLocalLM` — native Apple Silicon & Apple Intelligence backends

`AppleFoundationLM` — Apple Intelligence (macOS 26+)

`AppleLocalLM` — MLX (Apple Silicon, macOS 14+)

Why `fm.Tool` is subclassed dynamically at runtime

Why `_FMUsage` implements `iter`

Why caching lives in `forward()`, not `BaseLM.call`

Why `response_format` is intercepted in `aforward()`, not at the DSPy adapter layer

Per-call `LanguageModelSession` (stateless pattern)

Why `aforward()` uses a lazy `asyncio.Semaphore`, not a threading lock

Why `tools` raises `NotImplementedError` in `AppleLocalLM` rather than being silently dropped

Why token counts are computed from the tokenizer, and why `response_cost = 0.0`

Streaming strategy for `AppleLocalLM`

Why `session.respond(generating=...)` is wrapped in `try/except` with session recreation

Why `max_concurrency > 1` emits a warning instead of being hard-capped at 1

zombat commented Mar 22, 2026 •

edited

Loading