iOS 26: Foundation Model Framework - Code-Along Q&A

New educational approach from Apple

Oct 06, 2025

As we discovered in the previous post, the new code-along session is Apple’s fresh approach to explaining frameworks like Foundation Models. Alongside the live coding guide, a Q&A session was running, where every question—whether about the session itself or the framework in general—was answered. That’s almost unimaginable generosity from a company whose Developer Forums can sometimes take weeks to respond.

In this post, as promised, I’m sharing my questions and those from others, along with answers (and links for some of them). Everything is sorted, split into sections, and grammar-checked. No more delays—the Q&A awaits!

Disclaimer: All answers below are valid as of the publication date. Features mentioned as Beta or related to future releases may change or even be reverted in later updates.

General Usage 🦾

What is the maximum number of tokens or characters that can be sent in a prompt when using the on-device foundation model?

The current on-device model has a 4K context window (4,096 tokens total — input plus output combined).

Does the 4K context limit also apply to returned data? What if it is asked to return 10 (or 100) itineraries? Also, is there a way to get the response back as JSON (for storage or transmission)?

The context size applies to input and output tokens. For conversion to JSON, a type can be both Generable and Codable. You can then use JSONEncoder to encode the type to JSON data for storage or transmission

How can we handle very large inputs that exceed the 4K token context limit?

You can chunk the input and process it in segments, or summarize context progressively. For large datasets, consider hybrid processing with cloud models.

Can we limit token output to a specific number, like a maximum response length?

Yes, you can set a maximum output token count using GenerationOptions.

I use guided generation with an array of strings. The results are shown in a picker intended to be suggestions in my app for next steps. I noticed the first result in the array is often similar to, or even the same as, a previous request. How can I tweak this to get more variety and different outputs?

You can try adjusting GenerationOptions like SamplingMode and temperature: https://developer.apple.com/documentation/foundationmodels/generationoptions

Why does the model return the same results even though we didn’t use any sort of seed in our code?

The sampling GenerationOption is set to .greedy. This helps the model choose the same tokens every time if given the same input.

How much memory does the on-device model take up when loaded?

The model takes about 1.2GB of RAM once loaded into memory.

Is there any API to check how much memory or disk space the model is currently using?

There isn’t a direct API, but you can use Instruments or system diagnostics to measure memory and storage usage.

I don’t understand PartiallyGenerated — why is it needed?

Good question! This is needed if you want to stream a Generable type, because the model will generate it property-by-property. PartiallyGenerated will turn every property optional, where nil indicates the model hasn’t generated it yet. That’s especially useful for e.g. a Bool property, because using false as a default value would be confusing.

Can we stream responses from the model as they’re generated, or do we only receive the final result?

You can stream responses using Swift’s async sequences. This allows you to display tokens or sentences as they arrive.

Does streaming structured output work the same way as streaming plain text responses?

Yes, you can stream partial structured output, and it arrives property-by-property, allowing progressive updates in your UI.

Does structured output work with nested objects and arrays, or only flat structures?

Yes, structured output supports nested Generable objects and arrays. Just make sure each type in the structure conforms to Generable.

Can we customize the on-device model, such as fine-tuning it with our own data?

No, the on-device foundation model cannot be fine-tuned. You can, however, guide its responses through system prompts, few-shot examples, and tools.

Can we combine both on-device and cloud foundation models within the same app?

Yes — you can decide dynamically whether to use the on-device or cloud model depending on the task, network availability, or privacy needs.

Is there a way to test different models or versions side by side during development?

Yes, you can create multiple Session instances with different model identifiers and compare their outputs in the same app.

Are there any specific battery consumption considerations when using the on-device model?

Yes — running large or frequent generations can increase CPU/GPU use, so consider batching or caching results to save power.

Can the model’s responses be influenced by user preferences or stored data?

Yes, you can include user-specific context in your prompt or as tool inputs to personalize responses.

Can we expect 100% adherence in Structured Output even if the properties are PartiallyGenerated (in which case, as you mentioned before, they are marked as Swift optionals)?

Yes! Generable uses guided generation to make sure the model always outputs the correct format! And this even works with PartiallyGenerated and optionals. For more information, you can watch the Generable section of our Deep Dive video.

Can we use the foundation models framework from a Swift package, or must it be in the main app target?

You can import and use it in a Swift package as long as the deployment target supports iOS 18 or macOS Sequoia.

Can we restrict the model’s output to a fixed set of strings (like “yes” or “no”)?

Yes — you can use Guided Generation with an enum that conforms to Generable, ensuring the model outputs only valid values.

Can the foundation model generate or understand structured data like JSON or XML without using `Generable`?

Yes, you can instruct it to output JSON or XML via prompt engineering, but Generable ensures it follows your structure reliably.

Are there any sample projects or templates available that demonstrate using tools with the on-device model?

Yes, you can find example projects in Apple’s developer documentation and WWDC session materials, particularly “Integrate Foundation Models into Your App.”

Does the model retain context between app launches, or is it reset when the app restarts?

Context is reset when the app restarts; it’s tied to the session’s lifetime, not persisted storage.

Can we provide a system prompt to guide behavior, similar to how you can in ChatGPT?

Yes — you can provide a system message at the start of your session to define context, tone, or behavior.

What happens if the structured output schema doesn’t match what the model returns?

If the output doesn’t match, decoding will fail gracefully — you’ll get a partial result or an error depending on your decoding logic.

Can we adjust temperature or top-p parameters for generation like we can with cloud models?

Yes, you can adjust both temperature and sampling mode using GenerationOptions.

Are there limitations when running the on-device model in the background or during multitasking?

Yes, the model is paused or unloaded when your app enters the background to conserve system resources.

Can we access token-level probabilities or confidence values from the model’s output?

Not at the moment. The on-device foundation models API does not expose per-token probabilities.

Does the model automatically handle punctuation and capitalization when generating structured output?

Yes, the model generates well-formed text and correctly formatted structured outputs by default.

Can we interrupt or cancel an in-progress generation request if the user changes their input?

Yes, you can cancel the ongoing task by calling Task.cancel() on the async operation handling the generation.

Can the model process or summarize audio transcripts generated from the Speech framework?

Yes, as long as you provide the transcript text as input. The foundation model itself does not process audio directly.

Does the on-device model support image or multimodal inputs?

Currently, no. The on-device foundation models support text-only input and output.

Can the on-device model generate code snippets or handle technical prompts?

Yes, but since it’s smaller than the cloud models, results may be less detailed or accurate for complex technical tasks.

How is user privacy handled when using the on-device model?

All processing occurs entirely on the device. No data is sent to Apple or external servers when using the on-device foundation model.

Tools 🛠️

Can foundation models be used in a tool (for example, to get an estimated cooking time from cooking instructions)?

A new API introduced in the current beta of iOS 26.1 also provides access to the transcript of the session from within the Tool, in case it’s useful to you. But please note that this is still in beta.

Why is this tool conforming to `@Observable`? Is it mandatory for them?

Not mandatory for all tools. For this specific example, we’re adding state that we want to observe from our SwiftUI View.

Does the context limit also include the description provided by the tool?

Yes, the description of the tool is automatically included in the prompt when passing the tool to your session.

Is a tool’s response included in the context limit?

Good question! Yes, it is.

Does using tools or structured outputs affect latency compared to plain text generation?

There’s a slight overhead, but it’s minimal — typically under 10%. The reliability of output formatting usually outweighs the cost.

Can we control how the model handles errors, such as when a tool call fails?

You can catch and handle tool execution errors via Swift’s error-handling mechanisms in your tool’s implementation.

Optimizations 📈

Why would you not want to prewarm the model?

Prewarming is useful when your app has some idle time waiting for the user to trigger the generation. Prewarming may not help if you are presenting a proactive suggestion that’s not triggered by the user.

What is the estimated efficiency of prewarm? I’ve noticed in the final app that since I know the expected UI, I can press the button and still get the same delay compared to before I included prewarm.

prewarm will load the model into memory, ahead of making a request. If the model is already in memory (cached from a recent previous request), prewarm won’t show a difference. But for the case where the model wasn’t in memory yet, prewarm can easily save 500ms.

Does model loading or prewarming behavior differ between simulator and real devices?

Yes, the simulator does not fully simulate hardware acceleration, so load and generation times can differ significantly. For accurate profiling, always test on a real device.

If we use prewarm, can we unload or “cool down” the model later to free memory when not needed?

Currently, there’s no explicit unload API. The system manages unloading automatically when memory pressure requires it. You can infer it based on load time — if a generation suddenly takes longer, the model was likely unloaded.

Can we detect if the model is already loaded into memory, to avoid calling prewarm unnecessarily?

There’s no public API to check directly. If you call prewarm multiple times, the system will simply ignore redundant calls.

Is there a recommended way to cache or reuse a session between multiple user requests, or should we create a new one each time?

You can reuse a session for multiple related requests (to preserve context). If the requests are independent, creating a new session is fine — it helps reset the context and avoid unnecessary token buildup.

What are the main performance differences between the on-device model and the cloud model?

The on-device model is faster for short requests and provides privacy benefits, but it has a smaller context window and less raw reasoning power compared to the cloud model.

Is there any performance difference between running the same model on iPhone vs. iPad?

Performance varies slightly depending on device hardware — newer chips and more RAM provide faster responses and less latency.

How large can a structured output object be before hitting performance issues?

It depends on the complexity and context size. Larger objects mean longer generation time and higher memory use. Generally, staying under a few hundred fields is fine.

What option/template used for profiling? Is there a recommended approach for debugging model behavior when the output seems inconsistent?

There is specific template for it. You can also use any template and add the Foundation Models instrument on top. You may find the SwiftUI template or Time Profiler useful, as they will allow you to profile the on-device LLM together with your UI.

What’s the best way to measure real-world latency in an app using foundation models?

Use the Instruments tool or your own timing metrics around model initialization and generation calls.

Localization 🎤

Does the on-device model support multilingual input and output, or is it optimized for English only?

The on-device foundation model supports multiple languages, though quality may vary. It performs best with English but can handle several major languages.

Concurrency 🏎️

Can (or should) we run multiple requests in parallel to a model? What implications does that have on performance? Should we only do one request at a time, or does it work fine if we parallelize requests (and will they even parallelize)?

For one single session, you can’t call the respond method while the model is responding. You can create multiple sessions to run multiple requests in parallel. See for more info.

Can we chain multiple models or sessions together for multi-step reasoning or planning tasks?

Yes, you can run multiple sessions sequentially, passing results from one to the next. Each session is independent, so you’ll need to manage context manually.

Apple Review Guidelines 🕵️‍♂️

Are there any Apple guidelines for submitting an App Store app that uses a foundation model?

In addition to the App Review Guidelines, apps using the on-device foundation model are subject to the acceptable use requirements for the Foundation Models framework.

What is the consensus on Apple Intelligence-centered apps (i.e., apps that rely on it as a core part of their UI/UX)? It’s definitely always better to support all scenarios, but I’m thinking there could be cases where AI is the core part of the app.

There is a section in the HIG for Generative AI that addresses this. You can find it at https://developer.apple.com/design/human-interface-guidelines/generative-ai.

Acknowledgments 🏆

A heartfelt thank-you to everyone who participated and contributed thoughtful, insightful, and engaging questions throughout the session — your curiosity and input made this discussion both rich and collaborative.

Special thanks to:

James Dempsey, David Navalho, Joyal Serrao, Francesco Campanile, Ilian, Bruno Diniz, Bharath, Adam Ure, Pradeep Elankumaran, Bar, Ramin Firoozye, Jordy Witteman, Cristian Dinca, Dan Lee, Isaac Kim, Jin Yan, Alex Paul, Igor Dorogokuplia, Jared Hunter, Akshay, Steve Spigarelli, AJ Ram, Piotr Chojnowski, Lopes, Rajiv Jhoomuck, Jon Judelson, Nikita Korshunov, Chang Chih Hsiang, Chaitanya Kola, John Goering, Rik Visser, Maximilian Blaise Schuchart, Esteban RM, HebertGo, Abraz, Stefan Wille, Dev, Roxana Nagy, Arno Appenzeller, Amay Raj Srivastav, faraz qureshi, Momar, John Anderson, Rohith Yanapu, and Melissa Bain.

Finally, a sincere thank-you to the Apple team and moderators for leading the Code Along, sharing expert guidance, and offering clear explanations of the Foundation Models framework. Your contributions made this session an outstanding learning experience for everyone involved.

Anton’s Substack

Discussion about this post

Ready for more?