My Use Case: A Note-Taking App with Voice Memos
In one of my long-running note-taking apps, we’ve had audio attachments for years—users record voice notes, and we transcribe them to make them searchable. For that, we’ve relied on Apple’s SFSpeechRecognizer
since its introduction.
Despite its limitations—network dependence, language support, and somewhat clunky streaming interface—he did the job well enough. Our team was particularly happy about the option to transcribe short audio clips offline under some conditions. Detection quality were pretty high and even non-concurrent version fit pretty well in our codebase.
With iOS 26, Apple delivered a new modern approach to solve this task. This doesn’t mean that you should drop the old one - not necessary. Let’s take a look what did we have since iOS 10.
A Recap: What is SFSpeechRecognizer
?
SFSpeechRecognizer
is part of the original Speech framework, designed to convert spoken audio into text. It supports two primary types of requests:
File-based (
SFSpeechURLRecognitionRequest
): For transcribing local audio files.Buffer-based (
SFSpeechAudioBufferRecognitionRequest
): For live or streamed audio, typically from the microphone.
More about usage pipelines can be found on Official Documentation.
Let’s take look how transcription is done:
//Creating Recognizer
let recognizer = SFSpeechRecognizer(locale: Locale.getPreferredLocale())
//Creating request
let request = SFSpeechURLRecognitionRequest(url: url)
if recognizer?.supportsOnDeviceRecognition ?? false {
request.requiresOnDeviceRecognition = true
}
/?Setting parameters
request.taskHint = .dictation
request.shouldReportPartialResults = false
if #available(iOS 16, *) {
request.addsPunctuation = true
}
guard recognizer != nil else {return}
guard recognizer?.isAvailable ?? false else {return}
//Gathering results
var descriptionArr: [String] = []
//Starting it!
recognizer?.recognitionTask(with: request) {(result, error) in
if let error = error { printWarning("StT error: \(error)") }
guard let result = result else { return }
descriptionArr.append(result.bestTranscription.formattedString)
if !result.isFinal { return }
let description = descriptionArr.filter({!$0.isEmpty}).joined(separator: ". ")
}
It was very easy to use but, it comes with several notable limitations:
Online dependency: Most languages and features require an internet connection, although some offline support is available on newer devices.
Limited configuration: You can enable partial results or hints, but you have little control over the transcription engine.
Locale constraints: You must check if a language is supported via
SFSpeechRecognizer.supportedLocales()
. And no partial loading.Audio duration limit: Apple recommends keeping audio files under 1 minute for optimal performance. Although transcription may still work beyond that, behavior is undocumented and unreliable—especially for buffer-based requests, which are subject to stricter limits (~1 minute) and may fail silently or return partial results.
These constraints made it difficult to build robust transcription pipelines for longer content or offline-first workflows.
Meet the Future: SpeechAnalyzer
iOS 26 introduces a major upgrade to Apple’s speech technology: the SpeechAnalyzer
class. Designed for performance, flexibility, and full offline operation, it brings a modular, concurrency-friendly API for everything from dictation to custom model management.
SpeechAnalyzer
is not a single monolithic class, but a collection of tools and components organized under the Speech module. The class allows developers to:
Transcribe audio files or streams
Detect when speech starts and stops
Use specialized models for different transcription styles (e.g., dictation vs. command-based)
Download and manage on-device speech models for specific locales
This modularity makes it far more powerful and scalable compared to SFSpeechRecognizer
, especially when working offline or with longer audio content. Let’s take a look what do we have?
Disclaimer! At the moment of publishing all documentation were at Beta 3 stage. So it might get changed and work not as expected. Later this September (after GM release) I will check the logic again.
At the heart of the logic are:
SpeechAnalyzer
Your main entry point—used to perform transcriptions and other analysis on local audio data.
How to start analysis: you can use analyzeSequence(_:)
or analyzeSequence(from:)
for most transcription tasks—especially when leveraging Swift’s structured concurrency.
SpeechAnalyzer also supports autonomous, self-managed analysis. In this mode, the analyzer operates continuously in its own task as audio input becomes available. To enable this behavior, initialize the analyzer using a constructor that accepts an input sequence or file, or start it explicitly using start(inputSequence:)
or start(inputAudioFile:finishAfterFile:)
.
If you want the analysis to complete once the input finishes, call finalizeAndFinishThroughEndOfInput()
. To begin analyzing a new input after finishing the previous one, simply call one of the start methods again.
States: There are several ways an analysis session can be completed. Once a session is finished, the analyzer stops accepting new input and can no longer be reconfigured with different input sources or modules. Most operations on the analyzer will become inert at that point. Each module’s result stream will also close—no new results will be published, though you can continue to process any results that were already emitted.
It’s important to note that ending the input stream—for example, by calling AsyncStream.Continuation.finish()
—does not automatically end the analysis session. The analyzer remains active and ready to process new input. If your intention is to fully complete the session, you must explicitly call one of the analyzer’s finish methods to do so.
Source of data: it accepts modern AsyncStream or URL to file with audio. All like the predecessor and we need to set what kind of task we are performing.
SpeechModules
This is our modern and boosted SFSpeechRecognitionRequest replacement. A namespace-like construct that offers three different engines:
DictationTranscriber: For natural, punctuation-aware dictation.
SpeechTranscriber: For clean speech-to-text, ideal for commands and simpler processing.
SpeechDetector: For detecting speech presence and timing without full transcription.
All of them are subclassed from SpeechModule
protocol to ensure results fetching and can be initiated either with extended options init or with preset. All of them varies based on Module. This is an example of SpeechTranscriber init:
func transcriber(for locale: Locale) -> SpeechTranscriber {
SpeechTranscriber(locale: locale,
transcriptionOptions: [],
reportingOptions: [.volatileResults],
attributeOptions: [.audioTimeRange])
}
Types of the Speech Module
The Speech module introduces three specialized engines for different types of speech processing:
DictationTranscriber
Optimized for free-form natural speech, this transcriber adds punctuation and understands more conversational structure—ideal for composing messages, emails, or taking long-form notes.
SpeechTranscriber
Designed for simpler, more structured recognition tasks, such as transcribing short commands, prompts, or keyword-based queries where minimal formatting is required.
SpeechDetector
This tool detects the presence and duration of speech without converting it to text. It's particularly useful for segmenting long recordings, identifying active voice regions, or triggering voice-activated actions.
When to Use Each Tool
Use
DictationTranscriber
when you want a full transcription with punctuation, sentence structure, and conversational formatting—like writing a document or note.Use
SpeechTranscriber
when you only care about raw words and want minimal formatting—for example, in command recognition or keyword search.Use
SpeechDetector
when you need to know if and when speech is present, without needing the actual content—useful for audio analysis, trimming, or indexing long recordings.
Let’s put this into practice and transcribe some real audio…
AssetInventory
New class in the space. Handles language models and downloadable assets. Finally, you can programmatically check and preload language packs (offline!), such as German, Spanish, or Japanese.
Another disclaimer! Currently
SpeechTranscriber.supportedLocales
returns [] - an empty array.
Now let’s return to Modules and examine what do we have in details.
Tutorial: Offline Transcription of a Italian Audio File
Let’s walk through a practical example: transcribing a local Italian .mp3
file using the new API, all with modern Swift Concurrency. I took a cool ElevenLabs service to generate an audio from this text:
Nell'antica terra di Eldoria, dove i cieli scintillavano e le foreste sussurravano segreti al vento, viveva un drago di nome Zephyros. [sarcastically] Non il tipo che “brucia tutto... [giggles] ma era gentile, saggio, con occhi come stelle antiche. [whispers] Perfino gli uccelli tacevano quando passava.
Who doesn’t like dragons?!
Step 1: Check & Download the Italian Language Asset
Apple sample code from WWDC provides a convenient methods to ensure that assets are available.
Check that it’s supported in general
Installed or not
And download if not installed
public func ensureModel(transcriber: SpeechTranscriber, locale: Locale) async throws {
guard await supported(locale: locale) else {
throw NSError(domain: "SpeechAnalyzerExample", code: 1, userInfo: [NSLocalizedDescriptionKey: "Locale not supported"])
}
if await installed(locale: locale) {
return
} else {
try await downloadIfNeeded(for: transcriber)
}
}
func supported(locale: Locale) async -> Bool {
let supported = await SpeechTranscriber.supportedLocales
return supported.map { $0.identifier(.bcp47) }.contains(locale.identifier(.bcp47))
}
func installed(locale: Locale) async -> Bool {
let installed = await Set(SpeechTranscriber.installedLocales)
return installed.map { $0.identifier(.bcp47) }.contains(locale.identifier(.bcp47))
}
func downloadIfNeeded(for module: SpeechTranscriber) async throws {
if let downloader = try await AssetInventory.assetInstallationRequest(supporting: [module]) {
try await downloader.downloadAndInstall()
}
}
Step 2: Create and Configure a SpeechTranscriber
func transcriber(for locale: Locale) -> SpeechTranscriber {
SpeechTranscriber(locale: locale, preset: .offlineTranscription)
}
Offline preset is used since my initial were designed to work offline and transcription should work the same.
Step 3: Perform Transcription on a Local Audio File
func transcribeFile() async throws -> String {
let fileUrl = try loadAudioFile(named: "dragon_sample", withExtension: "mp3")
//Setting locale
let locale = Locale(identifier: "it-IT")
//Creating Transcriber Module
let transcriber = transcriber(for: locale)
//The result Task which will be triggered
async let transcriptionResult = try transcriber.results
.reduce("") { str, result in str + String(result.text.characters) }
//Checking Assets
try await ensureModel(transcriber: transcriber, locale: locale)
//Now finally, our Analyzer
let analyzer = SpeechAnalyzer(modules: [transcriber])
if let lastSample = try await analyzer.analyzeSequence(from: AVAudioFile(forReading: fileUrl)) {
try await analyzer.finalizeAndFinish(through: lastSample)
} else {
await analyzer.cancelAndFinishNow()
}
return try await transcriptionResult
}
//Helper method to load file from Bundle
func loadAudioFile(named name: String, withExtension ext: String) throws -> URL {
guard let url = Bundle.main.url(forResource: name, withExtension: ext) else {
throw NSError(domain: "SpeechAnalyzer", code: 1, userInfo: [NSLocalizedDescriptionKey: "Audio file not found in bundle."])
}
return url
}
Why we are not getting results from the Analyzer itself? Previously, we had a call recognitionTask(with:resultHandler:)
or same with delegate. The things is that setModule
now accepts array of modules.
Modules can be added or removed to the analyzer mid-stream. A newly-added module will immediately begin analysis on new audio input, but it will not have access to already-analyzed audio. However, you may keep a copy of previously-analyzed audio and provide it to a separate analyzer.
That’s why we are getting possible results separately for each Module and each of them are constructing their own results stream. Modules can be even added lately for specific audio task!
Final Thoughts
With SpeechAnalyzer
, Apple has brought their speech tooling into the modern era—offline-first, model-driven, and deeply integrated with Swift Concurrency. Whether you're building voice-based command apps, dictation utilities, or accessibility tools, this new class opens up massive opportunities with minimal friction.