Conversation
…ackends Adds two new BaseLM subclasses for on-device inference on Apple Silicon with zero cloud dependency: - dspy.AppleFoundationLM — wraps Apple's LanguageModelSession (macOS 26+, Apple Intelligence) with native constrained decoding for structured outputs via @generable and full fm.Tool support for tool calling. - dspy.AppleLocalLM — wraps mlx-lm to run any HuggingFace model on Apple Silicon (M1/M2/M3/M4). Supports the full DSPy optimizer workflow including BootstrapFewShot and MIPROv2. Both adapters are zero-regression on Linux CI: all platform-gated imports are lazily loaded inside try/except guards, and __all__ is extended conditionally so from dspy.clients import * never raises AttributeError on non-macOS. Verified on macOS 26.3.1 with Apple Intelligence: - 64 unit tests pass (mocked SDKs, Linux-compatible) - 11 integration tests pass against real on-device model - Native @generable constrained decoding, tool calling, and async confirmed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mlx_lm.generate() removed the direct temperature= kwarg; temperature is now configured via make_sampler(temp=...) from mlx_lm.sample_utils passed as sampler=. Update _generate() to use the new API and update the unit-test fake module to expose a sample_utils submodule with a stub make_sampler so all 28 tests continue to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a 'cot' demo to apple_on_device_lm.py that loads OpenAI's gpt-oss-20b via InferenceIllusionist/gpt-oss-20b-MLX-4bit and runs dspy.ChainOfThought on the bat-and-ball CRT question. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements DSPy's streamify() protocol for AppleLocalLM: - Add _LocalStreamChunk dataclass (text, model, predict_id fields) — a custom chunk type that passes through streamify()'s wildcard branch - In forward(): detect dspy.settings.send_stream and switch from mlx_lm.generate() to mlx_lm.stream_generate(), forwarding each token to the anyio MemoryObjectSendStream via anyio.from_thread.run(). This handles the primary path: streamify() -> asyncify -> anyio thread -> Predict.forward() -> lm.forward() - In aforward(): detect send_stream and use an asyncio.Queue bridge (_stream_generate_async) for direct async callers - Add 5 unit tests covering chunk delivery, text concatenation, and the forward(stream=True) guard message Note: _LocalStreamChunk is not a litellm ModelResponseStream, so StreamListener field-extraction is unavailable; all tokens stream raw. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 'streaming' entry to apple_on_device_lm.py showing dspy.streamify() with AppleLocalLM — prints tokens as they arrive then shows the parsed Prediction at the end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents conftest.py module name collision between experiments/DSPy-AppleFM/tests/ and tests/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eanup - All classes, methods, and functions now have docstrings with Google-style Args/Returns/Raises/Yields/Attributes sections - Replace bare `import typing` with explicit `from typing import Literal` (used in _pydantic_to_generable) alongside existing imports - Add return-type annotation to _FMUsage.__iter__ (Iterator[tuple[str, int]]) - Fix _FMMessage.tool_calls annotation: list -> list[Any] - Add _hidden_params dict generic: dict -> dict[str, Any] - Add _WrappedTool.call() docstring - Add inline comments on non-obvious expressions (id(), call_soon_threadsafe, lazy semaphore init, sentinel value, ARC del session) - No logic changes; 69/69 unit tests pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace stale "why stream=True raises NotImplementedError" rationale with the actual implemented streaming design: two code paths (primary forward() via anyio.from_thread.run in worker thread, secondary aforward() via asyncio.Queue bridge), stream_generate() vs generate_step() decision, and _LocalStreamChunk chunk type semantics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move all design-decision reasoning from TODO.md into PR-DRAFT.md under a new "Design Decisions" section. Remove checklist items and verification tables — the implementation is complete. TODO.md reduced to a one-line status pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I was multitasking, both os versions should have been 26+... |
There was a problem hiding this comment.
Pull request overview
This PR introduces native Apple on-device language model backends for DSPy via two new BaseLM adapters: one for Apple’s Foundation Models SDK (Apple Intelligence) and one for local MLX (mlx-lm) inference on Apple Silicon, along with tests, docs, and an example script.
Changes:
- Added
AppleFoundationLM(Apple Intelligence /apple_fm_sdk) andAppleLocalLM(MLX /mlx-lm) adapters with caching, tool support (Foundation only), and streaming support (Local viastreamify). - Added extensive unit tests using mocked SDK/modules and a macOS-only integration test suite for Foundation Models.
- Updated documentation and examples to surface the new models, plus small repo config tweaks (pytest
norecursedirs, gitignore).
Reviewed changes
Copilot reviewed 13 out of 15 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
dspy/clients/apple_fm.py |
Adds Apple Foundation Models adapter and shared OpenAI-like response types/helpers. |
dspy/clients/apple_local.py |
Adds MLX-backed local adapter with streaming and concurrency gating. |
dspy/clients/__init__.py |
Adds guarded exports for Apple adapters; adjusts cache configuration wiring. |
tests/clients/test_apple_fm.py |
Unit tests for AppleFoundationLM using a mocked apple_fm_sdk. |
tests/clients/test_apple_local.py |
Unit tests for AppleLocalLM using a mocked mlx_lm (incl. streaming). |
tests/integration/test_apple_fm_integration.py |
Mac-only integration tests against the real Apple Foundation Models SDK. |
tests/integration/__init__.py |
Package marker for integration tests. |
examples/apple_on_device_lm.py |
Runnable demos for Foundation, MLX local, mixed pipelines, and streaming. |
docs/docs/learn/programming/language_models.md |
Adds Apple Foundation + Apple Silicon (MLX) usage tabs. |
docs/docs/api/models/AppleFoundationLM.md |
API reference stub for dspy.AppleFoundationLM. |
docs/docs/api/models/AppleLocalLM.md |
API reference stub for dspy.AppleLocalLM. |
docs/mkdocs.yml |
Adds Apple model docs pages to navigation. |
pyproject.toml |
Adds pytest norecursedirs to avoid recursing into certain directories. |
README.md |
Adds macOS extras section for Apple on-device models. |
.gitignore |
Ignores venv/ and .venv/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Raymond Rizzo <raymond.rizzo@gmail.com>
…est guard - Replace string concatenation with list+join in both streaming paths of AppleLocalLM.forward() and aforward() — avoids O(n²) for long outputs - Use asyncio.get_running_loop() instead of deprecated get_event_loop() in _stream_generate_async() - Silently discard temperature/max_tokens in AppleFoundationLM.aforward() before the unknown-kwargs warning — they're valid DSPy params included in the cache key but unsupported by Apple's SDK - Wrap AppleFoundationLM init in integration test fixture with try/except RuntimeError so tests skip cleanly on Macs without Apple Intelligence Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
zombat
left a comment
There was a problem hiding this comment.
Updates as per-copilot suggestions.
When response_format is a Pydantic model, AppleLocalLM now builds an outlines FSM logits processor and passes it to mlx_lm.generate() / stream_generate(), guaranteeing the model output matches the schema. Processor is cached per schema to amortise FSM compilation cost. Falls back to prompt-only mode with a warning when outlines is not installed (pip install 'outlines[mlxlm]'). Cache key includes a response_schema field when a schema is active to prevent cross-contamination with unconstrained calls.
…Error Apple's on-device content filter raises GuardrailViolationError when it rejects a prompt. Previously this either bubbled up as an opaque SDK exception or got swallowed by the generable-fallback retry logic (which would hit the same guardrail again anyway). Now both aforward() call sites catch it by class name and immediately re-raise a RuntimeError with a clear message directing the user to rephrase their input. Three tests cover the plain-text path, the @generable path, and passthrough of unrelated exceptions.
The 4 response dataclasses verbatim (preserving all docs/comments/style) _flatten_messages() verbatim _AppleBaseLM(BaseLM) abstract base with a single concrete method: a unified _build_response(text, usage=None) that both adapters can inherit apple_fm.py edits: Remove the 4 dataclasses + _flatten_messages (~120 lines) Import from apple_base instead AppleFoundationLM(BaseLM) → AppleFoundationLM(_AppleBaseLM) Drop its _build_response override (now inherited) Result: 530 lines apple_local.py edits: Swap from dspy.clients.apple_fm import ... → from dspy.clients.apple_base import ... Remove inline from dspy.clients.apple_fm import _flatten_messages inside _apply_chat_template AppleLocalLM(BaseLM) → AppleLocalLM(_AppleBaseLM) Drop its _build_response override (now inherited) Switch _LocalStreamChunk from @DataClass shorthand to @dataclasses.dataclass to match file conventions Result: 710 lines
…ter + MLX mixin Introduce apple_base.py as the single home for the OpenAI-compatible response types (_FMMessage, _FMChoice, _FMUsage, _FMResponse), _flatten_messages, _run_async, and _AppleBaseLM — eliminating the awkward private cross-import from apple_local → apple_fm. Extract all MLX inference internals (_MLXMixin, _LocalStreamChunk, _apply_chat_template, _response_format_to_schema) into apple_local_mlx.py, reducing apple_local.py to a thin public adapter. Deduplicate _build_response (now on _AppleBaseLM), _raise_for_guardrail (static method on _AppleBaseLM, replaces two identical inline blocks in aforward), and token-counting/_FMUsage construction (_compute_usage on AppleLocalLM, used by both forward and aforward). Line counts before → after: apple_fm.py 652 → 467 apple_local.py 742 → 473 apple_base.py — → 241 (new) apple_local_mlx.py — → 300 (new) Also: remove unused imports flagged by ruff, convert dict() calls to literals (C408), re-export _FM* types from apple_fm for import compatibility. No behaviour changes; 80/80 tests pass.
Great catches. It honestly started to get away from me a bit as it came together, but it's been broken up now. 1, and 2 done. 3 ongoing... |
…call fields
litellm 1.82.4 (Pydantic V2) raises AttributeError when ModelResponse.usage
is not set; fall back to {} instead of crashing. Also strip None-valued fields
from Responses API tool call dicts — 1.82.4 adds namespace=None to
ResponseFunctionToolCall, breaking dict equality assertions.
|
#3 complete |
feat: Add
AppleFoundationLMandAppleLocalLM— native Apple Silicon & Apple Intelligence backendsSummary
This PR adds two new
BaseLMsubclasses that allow DSPy programs to run entirely on-deviceon Apple Silicon, with no cloud dependency:
dspy.AppleFoundationLM— wraps Apple'sLanguageModelSession(macOS 26+, AppleIntelligence) with native constrained decoding for structured outputs and full
fm.Toolsupport for tool calling.
dspy.AppleLocalLM— wrapsmlx-lmto run any HuggingFace model natively on AppleSilicon. Supports the full DSPy optimizer workflow including
BootstrapFewShotandMIPROv2.Both adapters are zero-regression on Linux CI: all platform-gated imports are lazily
loaded and wrapped in
try/exceptguards. The entire test suite (69 unit tests) passes onLinux/WSL with no macOS dependencies.
Reviewer Note: Feature-Pass Evidence vs Pre-Existing Failures
Feature-pass evidence (this PR path)
pre-commit run --all-filespasses on this branch.tests/clients/test_apple_fm.pytests/clients/test_apple_local.pytests/integration/test_apple_fm_integration.pyPre-existing failures (not introduced by this PR)
pytest -vv tests/) reports failures outside Apple-path files.AttributeError: 'ModelResponse' object has no attribute 'usage') in non-Apple tests.Apple-Only Reviewer Checklist
import dspysucceeds when Apple SDKs are absent.dspy.AppleFoundationLMinit guards: clear errors for non-macOS / missing SDK / unavailable model.dspy.AppleLocalLMinit guards: clear errors for non-macOS, Intel Mac, missingmlx-lm.AppleFoundationLM:response_formatmaps to native constrained decoding when supported.AppleFoundationLMexecutes DSPy tools viafm.Toolwrapping.AppleLocalLMrejectstools=[...]with explicitNotImplementedError.AppleLocalLMsupportsdspy.streamify()token streaming.stream=Truedirect call behavior remains explicit and documented.AppleLocalLM(max_concurrency) prevents uncontrolled parallel MLX calls.examples/apple_on_device_lm.pyfor at least one Foundation and one Local path.Motivation
DSPy programs that optimize prompts via
BootstrapFewShotorMIPROv2typically makehundreds or thousands of LM calls. Running those optimization loops against cloud LLMs is
expensive and slow. Apple Silicon's unified memory makes it possible to run a capable
quantized model (e.g.
mlx-community/Llama-3.2-3B-Instruct-4bit) at sub-100ms latencywith zero API cost. Developers can now:
AppleLocalLM.AppleFoundationLMin shipping macOS apps that need private, on-device inference.Changes
New files
dspy/clients/apple_fm.pyAppleFoundationLMadapter + shared response types (_FMResponse,_FMUsage,_FMChoice,_FMMessage)dspy/clients/apple_local.pyAppleLocalLMadapter (MLX backend, CoreML stub),_LocalStreamChunktests/clients/test_apple_fm.pyAppleFoundationLMtests/clients/test_apple_local.pyAppleLocalLM(includes streaming)tests/integration/test_apple_fm_integration.pydocs/docs/api/models/AppleFoundationLM.mddocs/docs/api/models/AppleLocalLM.mdexamples/apple_on_device_lm.pyModified files
dspy/clients/__init__.pytry/except ImportErrorexports for both adaptersdspy/__init__.pydspy.AppleFoundationLM,dspy.AppleLocalLMtop-level exportsdocs/docs/learn/programming/language_models.mdpyproject.tomlnorecursedirsguard to prevent pytest from enteringexperiments/Architecture
AppleFoundationLM— Apple Intelligence (macOS 26+)Native structured outputs. When DSPy passes
response_format=SomePydanticModel(itsstandard structured-output path), this adapter intercepts it before it becomes a prompt
injection.
_pydantic_to_generable()maps Pydantic field constraints to Apple's@generableconstrained decoding:Literal["a", "b"]→fm.guide(anyOf=[...])int = Field(ge=1, le=5)→fm.guide(range=(1, 5))str = Field(pattern=r"\d+")→fm.guide(regex=...)The model is then called with
session.respond(generating=<generable_cls>), which guaranteesvalid typed output at the token level — not a JSON parse of free text. The result is serialized
back to JSON so DSPy's output parser sees the same contract it would from the prompt path.
If
_pydantic_to_generable()can't map a field (e.g. complex nested type), or if the Swiftgrammar compiler rejects the schema at runtime, the adapter logs a warning, recreates a fresh
session, and retries without
generating=, falling back gracefully to DSPy's standardprompt-injection path.
Tool calling. Apple's SDK requires tools to be subclasses of
fm.Tool, not plaincallables.
_dspy_tool_to_apple_tool()dynamically subclassesfm.Toolat call time foreach DSPy tool, wiring
call(**kwargs)to the DSPy callable. Generated subclasses are cachedby
(tool_name, id(func))so Apple's per-class SDK registration fires exactly once perunique tool.
Async bridging. Apple's SDK is async-only.
forward()bridges to sync viaasyncio.run()withnest_asynciosupport for Jupyter notebooks.AppleLocalLM— MLX (Apple Silicon, macOS 14+)Mixed-LM pipelines. The primary use case is cheap on-device preprocessing before
expensive cloud reasoning:
Streaming.
AppleLocalLMsupportsdspy.streamify()via DSPy'ssend_streamprotocol.Wrapping any program with
dspy.streamify()causesforward()to callmlx_lm.stream_generate()and push each_LocalStreamChunktoken to the stream in real time.Concurrency gate. DSPy optimizers issue many parallel
aforward()calls. Unconstrainedconcurrent MLX inference jobs would exhaust Apple Silicon's unified memory pool and OOM.
aforward()gates all calls through a lazily-initializedasyncio.Semaphore(max_concurrency)(default: 1) before offloading to
asyncio.to_thread(). Users with spare RAM can raise thelimit at construction time; the adapter warns if
max_concurrency > 1since MLXthread-safety on a single model instance is undocumented.
Context window tracking.
context_windowis read fromtokenizer.model_max_length(with a 4096 fallback). A warning is logged when a promptwould exceed the window rather than silently truncating.
Shared design decisions
Explicit caching. Both adapters bypass LiteLLM, so LiteLLM's automatic caching is
unavailable.
dspy.cache.get/putis wired explicitly in eachforward(). The cache keycovers
{model, messages, temperature, max_tokens}; DSPy-internal keys (num_retries,stream,n) are excluded to prevent spurious misses. Unknown kwargs are warned and clearedso they cannot silently fragment the cache.
BaseLMresponse contract._FMUsageimplements__iter__to yield(key, value)pairs so
dict(response.usage)works as expected byBaseLM._process_completion._FMResponsecarries an explicit_hidden_params={"response_cost": 0.0}field —Nonecost would cause
sum([None, ...])to raiseTypeErrorin DSPy's history aggregator.Explicit errors over silent degradation.
stream=TrueraisesNotImplementedError(a streaming caller expects an async generator, not a string — use
dspy.streamify()).tools=[...]raisesNotImplementedErrorinAppleLocalLM(mlx-lm has no native tool API)with a pointer to
AppleFoundationLMfor users who need tools. Unknown backends raiseValueError.Testing
NOTE: Tests will skip on non-mac systems causing a skip during CI.
Unit tests (Linux/WSL — zero macOS dependencies)
Each test file defines
_make_fake_fm_sdk()/_make_fake_mlx_lm()factories that injectsynthetic
types.ModuleTypeinstances intosys.modulesviaautousefixtures. This letsevery logical path — message flattening, Pydantic→generable conversion, tool wrapping, cache
hit/miss, concurrency gating, kwarg warn/clear, context overflow warning, ARC session
fallback, streaming chunk emission — be exercised without any Apple hardware or SDK.
Integration tests (Mac only)
These import the real
apple_fm_sdkand skip cleanly on non-macOS platforms:Coverage includes: live round-trip generation, structured output via
@generable, toolinvocation, cache round-trip against the real
dspy.cache, andAppleLocalLMmlx-lmgeneration.
Design Decisions
Non-obvious implementation choices made during development.
Why
fm.Toolis subclassed dynamically at runtimeApple's SDK requires tools to be registered as subclasses of
fm.Tool— you can't pass acallable or wrap a plain function. DSPy tools, on the other hand, are arbitrary Python objects
(callables, instances with
.func, ordspy.Toolwrappers). There's no static base class tosubclass at module level because
fm.Tooldoesn't exist untilimport apple_fm_sdkruns,which can only happen on macOS 26+.
_dspy_tool_to_apple_tool()usestype()at call time to dynamically create a fresh subclassof
fm.Toolfor each DSPy tool, wiringcall(**kwargs)to the DSPy callable. A top-levelclass _WrappedTool(fm.Tool): ...would make the entire module unimportable on Linux. Dynamicsubclassing keeps the import guard clean: the class is only created inside
aforward(), whichis only reached after
__init__has already validated the platform and imported the SDK.Generated subclasses are cached by
(tool_name, id(func))so any per-class SDK-sideregistration fires exactly once per unique tool.
How we mocked an entire OS-specific SDK to get unit tests passing on Linux
apple_fm_sdkdoesn't exist on Linux.mlx_lmdoesn't exist outside Apple Silicon. Both areimported lazily inside methods. Each test file defines a factory that returns a
types.ModuleTypepopulated with hand-rolled Python stand-ins, then injects it intosys.modulesvia anautousefixture before any import of the real package can occur.Key constraints that shaped the fakes:
guide()must return"", notMagicMock._pydantic_to_generable()passes guidereturn values as dataclass field defaults, then calls
dataclasses.asdict(). If a fieldholds a
MagicMock,asdict()raisesTypeError: Object of type MagicMock is not JSON serializable.generable()must be a passthrough decorator. Iffm.generable(cls)returns aMagicMock,
dataclasses.make_dataclassproduces a class the fakesession.respond()can'tinstantiate.
LanguageModelSessionmust be async-context-safe. The fakerespond()is anasync defthat returns a plain string or a dataclass instance depending on whethergenerating=was passed.Platform checks are patched at the
platformmodule level. Patchingplatform.systemglobally in the fixture means even indirect callers see"Darwin".mlx_lm.sample_utilsmust be registered as a real submodule insys.modules. A flattypes.ModuleTypewith an attributesample_utilsis not the same as a registered submodule— Python's import system resolves
from mlx_lm.sample_utils import make_samplerby lookingup
"mlx_lm.sample_utils"insys.modulesdirectly.Why
_FMUsageimplements__iter__BaseLMcallsdict(response.usage)to record token counts. This works for LiteLLM objectsbecause their
Usageclass supports the mapping protocol._FMUsageis a plain dataclass.Adding
__iter__to yield(key, value)pairs makesdict()work without converting_FMUsageto a dict subclass. Cheapest fix that satisfies the contract.Why caching lives in
forward(), notBaseLM.__call__dspy.LMgets automatic caching via LiteLLM's response cache.BaseLMsubclasses thatbypass LiteLLM get no caching —
BaseLM.__call__does not cache. Both Apple adapters wiredspy.cache.get() / put()explicitly inforward(). The cache key covers{model, messages, temperature, max_tokens}with DSPy-internal keys (num_retries,stream,n) excluded to prevent spurious misses.Why
response_formatis intercepted inaforward(), not at the DSPy adapter layerDSPy's
ChatAdapterinjects a JSON schema into the prompt for structured output requests, thenparses the text response back. Apple's SDK offers something better:
session.respond(generating=SomeGenerableClass)triggers native constrained decoding — themodel is guaranteed to emit valid tokens for that schema.
Intercepting
response_formatinaforward()(before it becomes a prompt injection) lets usroute Pydantic models through the native path. The result is serialized back to a JSON string
so DSPy's output parser sees exactly what it would have seen from the prompt-based path — same
contract, better reliability on small on-device models.
Per-call
LanguageModelSession(stateless pattern)Apple's
LanguageModelSessionis designed to maintain conversational state across turns. DSPymodules are stateless — each
forward()call is independent, and DSPy manages its own promptconstruction (including injecting few-shot examples). Reusing a session across calls would
accumulate spurious conversation history and produce wrong outputs. A new session is created on
every
aforward()call; the overhead is acceptable for on-device inference (no networkround-trip).
Why
aforward()uses a lazyasyncio.Semaphore, not a threading lockDSPy optimizers evaluate candidate prompts in parallel via many concurrent
aforward()calls.Without a gate, 20 concurrent optimizer candidates submit 20 simultaneous MLX inference jobs,
each allocating activation memory in Apple Silicon's unified memory pool — instant OOM.
An
asyncio.Semaphoreinaforward()is the natural gate: callers suspend cooperativelybefore submitting to the thread pool, so only
max_concurrencyblocking jobs run at a time. Athreading.Semaphoreinside the sync path would also work but would block a thread-pool threadwhile waiting, wasting thread resources.
The semaphore is initialized lazily on first
aforward()call becauseasyncio.Semaphoremustbe created in the event loop that will use it —
__init__often runs outside any event loop.Why
toolsraisesNotImplementedErrorinAppleLocalLMrather than being silently droppedmlx-lmhas no native tool-calling API. Silently droppingtools=[...]would let DSPyprograms appear to run successfully while actually skipping all tool invocations, producing
wrong outputs with no diagnostic. The error message points directly to
AppleFoundationLM,which has full native
fm.Toolsupport.Why token counts are computed from the tokenizer, and why
response_cost = 0.0Apple's Foundation Model SDK exposes no tokenizer.
mlx-lmloads a HuggingFace tokenizer aspart of
mlx_lm.load(). Token counts are computed by encoding the flat prompt and thegenerated text with
tokenizer.encode()after inference. Accurate counts matter for DSPy'soptimization budget tracking —
BaseLMstoresdict(response.usage)in history and optimizercallbacks read
prompt_tokens/completion_tokensto estimate cost.response_cost = 0.0rather thanNone: on-device inference has no monetary cost, but DSPy'shistory aggregator sums
entry["cost"]across all calls.sum([None, ...])raisesTypeError.Setting
0.0explicitly makes the sum safe while accurately representing zero cost.Why unknown kwargs are warned-and-cleared instead of forwarded
Unknown kwargs (e.g.
top_p=0.9) would change the cache key without changing the model output— every unique
top_pvalue creates a new cold cache entry for what is functionally the samegeneration. Clearing them after warning prevents silent cache fragmentation and surfaces the
mismatch to users who set global
dspy.configureoptions expecting them to apply.Streaming strategy for
AppleLocalLMStreaming is supported via
dspy.streamify()using DSPy'sdspy.settings.send_streamprotocol, not via a
stream=Truekwarg.forward(stream=True)raisesNotImplementedErrorwith a message directing users tostreamify().Two code paths:
Primary path —
forward()in anyio worker thread (viaasyncify):streamify()wrapsPredict.__call__withasyncify, which runs it in an anyio-managedworker thread.
Predict.forward()callslm.forward()from that thread. Whendspy.settings.send_streamis set,forward()callsmlx_lm.stream_generate()synchronously and pushes each
_LocalStreamChunkto the anyioMemoryObjectSendStreamviaanyio.from_thread.run(send_stream.send, chunk).Secondary path —
aforward()for direct async callers:When
await lm.aforward()is called directly (bypassingPredict),_stream_generate_async()bridgesmlx_lm.stream_generate()(sync) to an async generatorvia
asyncio.Queue+loop.call_soon_threadsafe().mlx_lm.stream_generate()is used rather than the lower-levelgenerate_step()because itis the public high-level API that handles EOS detection, max-token limits, and token decoding
internally — avoiding fragile reimplementation of per-token control logic.
_LocalStreamChunk(text, model, predict_id)is a custom dataclass, not a litellmModelResponseStream. DSPy'sstreamify()passes custom chunk types through its wildcardbranch to the caller;
StreamListenerfield-extraction is unavailable and all tokens streamraw.
Why
session.respond(generating=...)is wrapped intry/exceptwith session recreation_pydantic_to_generable()can return a valid@generableclass, but the underlying Swiftgrammar engine can still reject the schema at inference time (e.g. a
Union[str, List[str]]field might compile without error yet fail when Apple's constrained-decoding compiler tries to
build the grammar automaton). On failure:
WARNING(so integration-test logs reveal schema issues).del session— the failed session may have advanced internal state.LanguageModelSessionfrom the already-builtsession_kwargs.await session.respond(prompt=flat_prompt)(nogenerating=).except Exceptionrather thanTypeError/ValueErrorspecifically: the Swift bridgesurfaces errors as Python
Exceptionsubclasses whose exact types depend on the SDK versionand are undocumented. The intent is unconditional fallback.
Why
max_concurrency > 1emits a warning instead of being hard-capped at 1MLX's Python bindings call into a C++/Metal backend. It is undocumented whether a single
mlx.nn.Moduleinstance supports concurrentgenerate()calls from multiple Python threads.If it does not,
max_concurrency > 1can cause Metal command queue deadlocks or segfaultswith no Python traceback.
Warn rather than cap: hard-capping would deny the benefit to users who test and confirm
thread-safety on their specific hardware + MLX version, or who load separate model instances
per thread. The default is
max_concurrency=1(always safe); users who want higher throughputopt in explicitly.
Notes for reviewers
macOS-specific setup (Apple paths only)
Use these steps only when validating
AppleFoundationLM/AppleLocalLMbehavior.Use Apple Silicon hardware and macOS 26+.
Enable Apple Intelligence in system settings (required for
AppleFoundationLM).Install DSPy development dependencies.
Install
mlx-lmforAppleLocalLMpaths:Install
apple_fm_sdkfrom Apple's distribution channel.It is not currently published on PyPI, so follow Apple's installation instructions for your SDK access level.
Verify imports:
python -c "import dspy; from dspy.clients import AppleFoundationLM, AppleLocalLM; print('ok')"Verify Apple-path tests:
Expected behavior on non-macOS or without
apple_fm_sdk: integration tests skip andimport dspyremains successful.AppleLocalLM(backend="coreml")raisesNotImplementedErrorwith an invitation tocontribute. The CoreML path is stubbed (not deleted) so the
backend=parameter ispart of the public API from day one.
LanguageModelSessionis intentionally created per-call (stateless pattern). DSPymanages all prompt construction including few-shot injection; reusing a session across calls
would accumulate spurious conversational history.