I was deep in a planning session, trying to document a complex feature implementation with Claude Code. The ideas were flowing fast, but the transcription software I was using felt like watching paint dry. Every time I recorded something, I'd wait 5+ seconds to see what I'd actually said. The conversation had moved on by the time I got my transcription back.
Sound familiar? That lag kills momentum. When you're in the zone, waiting for batch processing feels archaic. You lose your train of thought. The mental model dissolves. And don't get me started on existing transcription software that makes you choose between privacy (local offline models) and accuracy (cloud processing with inevitable latency).
I hit my breaking point not because I wanted to build another transcription app, but because I needed something that didn't exist.
The Epiphany Moment with Real-Time Audio
Here's what changed everything: while experimenting with different audio processing approaches, I integrated FluidAudio's CoreML models - specifically their StreamingEouAsrManager with Parakeet TDT v3 running on Apple Neural Engine. The results were immediate and compelling. Text started appearing almost instantly as I spoke, with end-of-utterance detection so precise it felt psychic.
Apple's Neural Engine isn't just marketing fluff - when you run audio models locally, you get incredible performance with minimal power consumption. Parakeet TDT v3 processes audio in near real-time (190× speed factor), giving me a glimpse of what transcription could be: immediate, fluid, conversational.
This was the peak into the future I wanted. But local models, while fast, sometimes trade accuracy for speed. I needed cloud-level accuracy with local-level responsiveness. The traditional approach would say "pick one," but I needed both.
The Technical Architecture I Built
To deliver streaming transcription that feels local but provides cloud accuracy, I built a dual-streaming architecture:
Local Streaming with FluidAudio: For offline functionality and maximum privacy, I integrated FluidAudio's CoreML models. The Parakeet TDT v3 model runs natively on Apple Neural Engine, giving offline real-time processing with 190× speed factor. Your audio never leaves your device, and you get instant feedback.
Cloud Streaming with Cloudflare Durable Objects: For when you need state-of-the-art accuracy, I built a streaming proxy using Cloudflare Workers. Each user gets their own Durable Object session with dedicated WebSocket connections. These persistent sessions maintain state across hibernation, ensuring uninterrupted streaming even during network interruptions. Audio streams directly to models in the cloud, and transcription streams back instantly.
The key insight: Both approaches deliver the same streaming experience. Whether you're using local models or cloud processing, you see text appear as you speak.
How the streaming works:
- Local V2: Real-time English streaming through Apple Neural Engine
- Local V3: Batch processing for 25+ languages (no real-time streaming)
- Cloud Models: Multilingual streaming delivered through WebSocket sessions
Technical foundation: FluidAudio CoreML for local models, Cloudflare Workers + Durable Objects for cloud streaming with state-of-the-art accuracy models.
The streaming architecture handles:
- Real-time audio chunking (audio flows while connection establishes)
- End-of-utterance detection with WebSocket persistence
- Model-specific latency optimization
- Automatic connection recovery during streaming
Most transcription software treats real-time as a novelty. I built streaming as the foundation.
What Makes Streaming Different From Real-Time
Here's where most people get confused: "real-time" transcription usually means "faster batch processing." True streaming means you see partial transcription as you speak, not after you're done speaking.
Traditional approach: Record → Upload → Wait → Download → Read
Streaming approach: Speak → See text appear → Continue speaking → See corrections
I implemented a live preview window that updates continuously at 140px height. You watch your words appear and immediately see patterns: "Wait, I said 'their' not 'there'" or "That's not what I meant" or "Let me rephrase that."
The preview window is the game-changer. At ~300-500ms latency, you're editing in real-time instead of reviewing later. Combined with end-of-utterance detection (the visual green glow when you pause), you get conversational feedback loops instead of archival processing.
Streaming Reality: V2 delivers English streaming at ~300-500ms latency, while V3 processes 25+ languages in batch mode. Cloud models (Pro tier) provide multilingual streaming with state-of-the-art accuracy.
Built for Getting Work Done
I didn't build Avaan for demos. I built it because I needed transcription that fit my workflow, not interrupted it.
Menu Bar Architecture: Avaan lives in your menu bar (no dock icon), accessible via Cmd+Shift+Space from anywhere. The floating recording window stays out of your way but close enough to monitor.
Context-Aware Transcription: 6 different modes optimized for different use cases:
- Auto Mode: Detects what you're working on
- Notes Mode: Perfect for meetings and planning
- Email Mode: Polished communication formatting
- Chat Mode: Casual conversation flow
- Code Mode: Technical content with syntax awareness
- Off Mode: Just record, no processing
Privacy by Design: With local models, your audio never leaves your device. With cloud streaming, audio streams encrypted and no data is stored permanently - it's processed and delivered back to you in real-time, then discarded. You stay in control of your privacy trade-offs.
Visual Feedback: Real-time visual cues show exactly what's happening. Recording in progress: red glow around the window. End of utterance detected: green glow. Successful paste: minimal "Pasted ✓" indicator.
Most software gives you "faster" processing. I built fluent, conversational interaction.
Available Now
I built Avaan because I needed real-time transcription that didn't interrupt my flow. The streaming preview window changes how you think about audio documentation - from "record and review later" to "see and adjust as you speak." Download at avaan.app/download →
Current model lineup: Free tier: V2 for English streaming and V3 for multilingual processing (batch mode)
- Pro tier: A1 Ultra for multilingual streaming plus A1/A1 Pro for enhanced accuracy with state-of-the-art cloud models
- Both tiers: Streaming preview window, end-of-utterance detection, all transcription modes
The streaming architecture supports multiple languages with automatic language detection. Local processing works offline with zero latency. Cloud processing streams encrypted audio and returns results in under a second.
I've used Avaan daily for the past month for feature planning, meeting notes, daily standups, and technical documentation. The live preview window means I catch transcription errors immediately rather than discovering them days later. The end-of-utterance detection gives me confidence in where sentences start and end.
Download Avaan at avaan.app/download → - Your words deserve real-time transcription. Let me know what you build with fluent, conversational streaming.
Most importantly - no more waiting for your thoughts to appear on screen. Download Avaan at avaan.app/download?utm_source=blog&utm_medium=technical_post&utm_campaign=announcement_dec2025&utm_content=saatvikarya and start streaming conversations today. Real-time transcription is ready when you are.
P.S. Want to dive deeper into the technical implementation? Check out FluidAudio on GitHub to explore the CoreML streaming libraries that power Avaan's local transcription.