WWDC26: Coding Intelligence, Machine Learning & AI - Q&A
Direct answers from Apple Engineers during WWDC26
Apple’s AI stack is no longer one framework or one model choice. This lab covers how Foundation Models, Core AI, Core ML, MLX, Xcode agents, evaluations, Vision, Private Cloud Compute, local models, context management, guardrails, and model deployment fit together.
As usual, the goal is simple: make the questions easier to scan, easier to revisit, and easier to connect with real app development problems.
I tried to preserve the original wording and combine related answers where appropriate. However, some inaccuracies or mismatches are still possible.
Enjoy! And subscribe so you don’t miss the next Lab.
Could you explain the roles of Core AI, Core ML, and MLX in simple terms from a beginner’s perspective? How should one understand them and decide which one to learn or use?
Think of Apple’s machine learning stack as layers, depending on how much control you need.
If you are building an LLM-based feature, start with the Foundation Models framework. Try the system language model first, then use evaluations to check whether it works for your use case. If you need more capability or a larger context window, use Private Cloud Compute. If you need your own model, you can still plug it into Foundation Models through the language model protocol.
If your work is not language-model based — for example diffusion, image segmentation, or another custom neural network — use Core AI. Core AI is the forward-looking path for neural-network workloads and for bringing custom models into apps.
Core ML remains available, especially for more traditional ML use cases such as decision trees and similar model types.
MLX is the lower-level and flexible option, especially useful for training, experimentation, distributed inference, local AI workflows, and cases where you want more direct control over model execution.
What is the on-device Foundation Models context window in iOS 27? Is input and output counted against one shared token budget?
The on-device system language model context window is 4,096 tokens. It is a shared budget, so input and output both count against the same window.
For example, if you send roughly 4,000 tokens as input, the model only has about 96 tokens left for output.
Private Cloud Compute supports a larger 32K context window, also as a shared budget. If you need even larger context windows, you can bring another provider or model through the language model protocol, including Core AI, MLX, or server-backed model packages.
Can Foundation Models run inside a background app refresh task or background processing task, especially while the phone is locked, asleep, or the app has been backgrounded for a while?
Yes, Foundation Models can run in the background, including inside background tasks. However, the system may rate-limit the app, especially if the OS is busy.
If that happens, the system language model can throw a rate-limited error. Your app should catch that error and try again later.
On macOS, the local Foundation Model should not be rate-limited as long as the app is in the foreground. Private Cloud Compute can also be rate-limited, but for different reasons, such as sending too many requests in a short period.
The quality of the output is not reduced by background execution. The request either runs or it does not; it may just take longer or require retrying if rate-limited.
For macOS 27 Apple Intelligence, what does the waitlist mean? Both Siri local and PCC models are working. Are we getting different models on the waitlist? Does this beta include AFM Core advanced 20B?
The waitlist applies only to Siri. It does not apply to the Private Cloud Compute language model or to the on-device pieces used by the Foundation Models framework.
The panel also confirmed that the beta includes AFM Core advanced, used for voice features and related capabilities.
For deeper Siri-specific details, the recommendation was to bring the question to the Apple Intelligence group lab.
The Foundation Models framework now supports bringing your own LLM provider alongside the on-device model and Private Cloud Compute. Can you mix all three within a single agentic flow? What are the data privacy and attribution boundaries once a third-party provider is in the loop?
Yes, you can mix them in a single agentic flow. The key API is dynamic profiles, which lets you route different parts of a workflow to different models in a declarative way.
The panel described two patterns: baton pass and phone a friend.
In a baton-pass flow, each model receives the full context from earlier steps. That works well when all models have the same privacy expectations, such as staying on-device or within Private Cloud Compute, or when you are comfortable sharing the full context.
In a phone-a-friend flow, the main model calls another model with only the specific question it needs help with. The other model does not see the full transcript. It returns an answer, and control goes back to the original model. This is better when you want a stronger privacy boundary or when a third-party model should not receive all prior context.
Dynamic profiles and profile modifiers can also help manage different context windows. For example, you can keep only the last few transcript entries, drop old tool calls after they are used, or shape the transcript differently depending on which model is selected.
My app uses on-device speech-to-text and must recognize names and proper nouns that general models miss. Does iOS automatically personalize recognition to each user, learning their words and pronunciations over time, or do I need to build and maintain that list myself?
The panel did not have the exact speech framework owner present, so they did not give a definitive API answer.
At a high level, speech personalization often involves a component that adapts to user-specific words, names, contacts, or pronunciations. Whether the system API already exposes the personalization you need depends on the speech recognition API and language you are using.
The recommendation was to ask this on the developer forums so the speech framework engineers can answer directly.
If you need custom behavior beyond the system speech APIs, it may be possible to build your own model or personalization layer, but the panel suggested starting with native speech recognition support first, especially for supported languages.
How do you train coding agents to know more about my code style or a specific area? I use a local LLM with Xcode or VS Code, and my complex codebase includes visionOS, Metal, physics simulation, and macros that generate 3D resources, but it does not perform very well.
Agents learn best when they can search, inspect, and write down what they discover.
First, give the agent access to examples from your project. It will often infer style from nearby source code. For stronger guidance, use files such as AGENTS.md or similar project-level instructions. Keep the always-included file short, because it consumes context on every request.
Reference other markdown files from there. For example, tell the agent where the networking guide, style guide, Metal conventions, or macro documentation live. Then the agent can search those files only when needed.
You can also ask the agent to document what it learns. If it discovers how your networking layer works or how a crash was fixed, have it write that down so future sessions can reuse the knowledge.
For Xcode 27, the panel recommended trying ACP support. ACP lets Xcode talk to the agent of your choice, including agents connected to local models through tools like LM Studio or Ollama. This is more capable than a simple chat-completion workflow because an agent can manage state, use tools, inspect files, and work in longer loops.
The model still matters. Smaller local models can often copy style and follow local conventions, but they may struggle with deep reasoning across a large and complex codebase. Xcode’s documentation search tools can help ground the model in new Apple APIs during beta periods.
With regards to UI testing, what practical steps can teams take to integrate automated approaches into their testing workflows on Apple platforms?
Think of testing in layers.
Start with small, fast unit tests. Agents are good at helping enumerate cases and generate many focused tests for small pieces of logic.
Then add a smaller number of integration tests that bring in dependencies and test broader behavior.
UI tests should be the final layer, not the only layer. They are more expensive and slower, so use them to verify the important connection points in the UI rather than every possible permutation.
Xcode 27 adds simulator interaction for agents. An agent can tap, swipe, type, inspect screenshots, and read the accessibility tree. That means it can explore an app, find issues, and then help generate repeatable UI tests so you do not need the agent to manually run the same exploration every time.
Have there been any updates to Natural Language processing and Apple Vision? Now that Foundation Models support image attachments, what should be the preferred method of image extraction?
If there is a specialized API for the task, use that first.
For well-understood image tasks such as barcode reading, OCR, segmentation, or detecting a known kind of object, Vision or image-understanding APIs are usually the right choice. They are efficient, specialized, and easier to test.
Foundation Models are better when the task requires semantic understanding, natural-language nuance, or open-ended interpretation. For example, if the user’s prompt changes the kind of image reasoning needed, an LLM or multimodal model may fit better.
The panel compared foundation models to a 3D printer: flexible and great for custom jobs. Specialized APIs are more like a production line: faster and better when the task is known and repeatable.
The same rule applies to translation. If you just need ordinary translation, use the Translation framework. If you need style transfer or unusual natural-language interpretation, then a language model may be appropriate.
On-device LLMs have relatively limited token capacity. What are the best practices for managing prompt size, tool definitions, and context to avoid exceeding limits while still maintaining high-quality responses?
Use the token-counting and response-usage APIs in the Foundation Models framework. They let you inspect context size, count tokens, and see input, output, cached, and reasoning token usage.
When context grows too large, you have several options. You can drop old entries, drop tool calls and tool outputs once they have already been used, or summarize the conversation history.
Apple open-sourced Foundation Models utilities, including a summarize-history modifier. It can summarize the transcript once it exceeds a configurable size. The default prompt can be overridden, which matters because the right summary depends on your use case.
There is a tradeoff. Dropping or summarizing context can invalidate the KV cache and increase latency, but keeping too much context can distract the model or hurt accuracy. Use the Evaluations framework to compare strategies on the same dataset and see which one works best for your feature.
Also avoid asking one session to do too many unrelated tasks. If tasks are independent, split them into separate sessions so each task has a fresh context window.
Foundation Models guardrails sometimes refuse emotionally intense but legitimate journal entries, such as grief or venting. Can I prevent refusals on first-person emotional writing, and how do I detect guard refusal versus other errors to fall back gracefully?
There are two separate concepts: guardrail errors and model refusals.
For input moderation, the system language model supports a setting called permissive content transformations. If enabled, the model should not error out just because the input is emotionally intense.
However, the model may still refuse in natural language to continue or elaborate on certain content. That is a model response, not the same thing as a guardrail error.
If you are using structured output, the model can throw a refusal error. That is separate from a guardrail error, which comes from a moderation model checking input or output.
If the behavior does not match your legitimate use case, file feedback. Apple also noted that guardrails have been improved this year to reduce false positives.
Apple has historically brought a distinct perspective to areas like design and privacy. What is your guiding philosophy or approach to AI evaluation?
Evaluation should not happen only after an AI feature is built. It should start at the beginning.
The panel described evaluation as the living specification of an AI feature: the set of things the feature should do well, edge cases it should handle, and future headroom it should grow into.
This leads to an evaluation-driven development lifecycle. You start with a small curated dataset, expand it, run the model or configuration, inspect where it succeeds and fails, and then iterate.
The Evaluations framework is designed to make that cycle easier. It supports datasets, model judges, comparison across configurations, and workflows where developers can improve the feature instead of guessing.
The larger philosophy is that AI features are non-deterministic, so validation must be part of design and development from the start.
Is it possible for models used by different apps on iPhone to be shared across apps? This could help save storage space for users.
In general, different apps cannot simply share one downloaded model across the system.
There are sandboxing, security, scheduling, and resource-management challenges. Even if two apps use what sounds like the same model, they may need different quantization, performance, quality, or memory tradeoffs based on their evaluations.
Apps from the same developer can share resources through an app group. Core AI model caching can also share resources inside a cache group when you have an app group.
But a system-wide shared model downloaded once and reused by unrelated apps is not available. If you can use the on-device Foundation Model built into the OS, that avoids increasing your app size and avoids shipping your own model weights.
🏆 Acknowledgments
A huge thank-you to everyone who joined and shared practical questions about coding intelligence, Foundation Models, Core AI, Core ML, MLX, evaluations, Vision, local models, Private Cloud Compute, Xcode agents, context management, guardrails, and model deployment.
Question acknowledgments: Jane Chow, Abi27, RB27, Desa, Indigo J, Claire Case, Pichaya_TR Yysy, Brian CM, John Lee, AO, and the online WWDC audience who submitted and upvoted the remaining questions.
Finally, a heartfelt thank-you to Shashank, Kevin, Eric, Stephen, Raziel, Angelos, and the teams behind the scenes for leading the session and explaining how Apple’s AI and machine learning tools fit together.

