Build an AI Voice Assistant with Whisper, Claude, and ElevenLabs in Python
Build a real-time voice assistant that listens, thinks, and speaks. Complete tutorial with speech-to-text, AI reasoning, and text-to-speech — all in Python.
Voice assistants are one of those projects that feel impossibly complex until you break them into three simple components:
- Speech-to-Text (STT): Convert voice to text (Whisper)
- AI Reasoning: Generate a response (Claude)
- Text-to-Speech (TTS): Convert response to voice (ElevenLabs)
Each component is an API call. The engineering challenge is gluing them together with good latency, proper audio handling, and a natural conversation flow. This tutorial gives you a fully functional voice assistant in ~200 lines of Python.
Architecture
Microphone → [Audio Buffer] → Whisper (STT) → Text
↓
Claude (Reasoning)
↓
Speaker ← ElevenLabs (TTS) ← Response Text
Total latency target: under 2 seconds from end-of-speech to start-of-response audio.
Prerequisites
pip install openai anthropic elevenlabs pyaudio numpy webrtcvad
System dependencies (Mac):
brew install portaudio ffmpeg
System dependencies (Ubuntu):
sudo apt-get install portaudio19-dev ffmpeg
Set your API keys:
export OPENAI_API_KEY="sk-..." # For Whisper API
export ANTHROPIC_API_KEY="sk-ant-..." # For Claude
export ELEVENLABS_API_KEY="..." # For TTS
Step 1: Audio Recording with Voice Activity Detection
The most important UX detail: the assistant should listen until you stop talking, then respond. We use WebRTC VAD (Voice Activity Detection) to detect when the user has finished speaking.
# src/audio.py
import pyaudio
import webrtcvad
import numpy as np
from collections import deque
import wave
import io
class AudioRecorder:
"""Records audio with voice activity detection."""
def __init__(
self,
sample_rate: int = 16000,
frame_duration_ms: int = 30,
padding_duration_ms: int = 500,
vad_aggressiveness: int = 2,
):
self.sample_rate = sample_rate
self.frame_duration_ms = frame_duration_ms
self.frame_size = int(sample_rate * frame_duration_ms / 1000)
self.padding_frames = int(padding_duration_ms / frame_duration_ms)
self.vad = webrtcvad.Vad(vad_aggressiveness) # 0-3, higher = more aggressive
self.audio = pyaudio.PyAudio()
def record_until_silence(self, silence_threshold_ms: int = 1000) -> bytes:
"""Record audio until the user stops speaking."""
stream = self.audio.open(
format=pyaudio.paInt16,
channels=1,
rate=self.sample_rate,
input=True,
frames_per_buffer=self.frame_size,
)
frames = []
silence_frames = 0
speech_started = False
silence_limit = int(silence_threshold_ms / self.frame_duration_ms)
print("Listening...")
try:
while True:
frame = stream.read(self.frame_size, exception_on_overflow=False)
is_speech = self.vad.is_speech(frame, self.sample_rate)
if is_speech:
speech_started = True
silence_frames = 0
frames.append(frame)
elif speech_started:
silence_frames += 1
frames.append(frame) # Keep some silence at the end
if silence_frames >= silence_limit:
print("Speech ended.")
break
finally:
stream.stop_stream()
stream.close()
if not frames:
return b""
# Convert to WAV bytes
return self._frames_to_wav(frames)
def _frames_to_wav(self, frames: list[bytes]) -> bytes:
"""Convert raw audio frames to WAV format."""
buffer = io.BytesIO()
with wave.open(buffer, "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2) # 16-bit
wf.setframerate(self.sample_rate)
wf.writeframes(b"".join(frames))
return buffer.getvalue()
def cleanup(self):
self.audio.terminate()
Why WebRTC VAD?
WebRTC VAD is fast (microseconds per frame), runs locally, and is battle-tested in production voice applications. It’s far more reliable than simple amplitude thresholding, which triggers on background noise.
Step 2: Speech-to-Text with Whisper
We use OpenAI’s Whisper API for transcription. It’s fast (~1 second for typical utterances), accurate, and handles accents well.
# src/stt.py
from openai import OpenAI
import io
client = OpenAI()
def transcribe(audio_wav: bytes) -> str:
"""Transcribe WAV audio bytes to text using Whisper."""
if not audio_wav:
return ""
audio_file = io.BytesIO(audio_wav)
audio_file.name = "recording.wav"
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text",
language="en", # Set to None for auto-detection
)
return transcript.strip()
Local Whisper Alternative
If you want to avoid API costs and don’t mind slower processing, run Whisper locally:
import whisper
model = whisper.load_model("base") # Options: tiny, base, small, medium, large
def transcribe_local(audio_wav: bytes) -> str:
"""Transcribe using local Whisper model."""
# Save to temp file (Whisper requires a file path)
import tempfile
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
f.write(audio_wav)
temp_path = f.name
result = model.transcribe(temp_path)
return result["text"].strip()
| Model | Size | Speed (RTF) | Accuracy | VRAM |
|---|---|---|---|---|
| tiny | 39M | 0.03 | Good | 1 GB |
| base | 74M | 0.05 | Better | 1 GB |
| small | 244M | 0.12 | Great | 2 GB |
| medium | 769M | 0.25 | Excellent | 5 GB |
| large | 1.5B | 0.50 | Best | 10 GB |
RTF = Real-Time Factor. RTF 0.05 means 1 second of audio is processed in 0.05 seconds.
Step 3: AI Reasoning with Claude
# src/reasoning.py
import anthropic
client = anthropic.Anthropic()
class ConversationManager:
"""Manages conversation history and Claude interactions."""
def __init__(self, system_prompt: str = None):
self.system_prompt = system_prompt or (
"You are a helpful voice assistant. Keep responses concise — "
"aim for 1-3 sentences unless the user asks for detail. "
"Be conversational and natural. Avoid bullet points and "
"markdown formatting since your responses will be spoken aloud."
)
self.history = []
def get_response(self, user_text: str) -> str:
"""Get a response from Claude given user input."""
self.history.append({
"role": "user",
"content": user_text,
})
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=300, # Keep responses short for voice
system=self.system_prompt,
messages=self.history,
)
response_text = message.content[0].text
self.history.append({
"role": "assistant",
"content": response_text,
})
# Keep history manageable (last 10 exchanges)
if len(self.history) > 20:
self.history = self.history[-20:]
return response_text
Key Design Decision: max_tokens=300
Voice responses need to be SHORT. A 300-token response takes about 20-30 seconds to speak. Anything longer and the user loses attention. If the user wants more detail, they’ll ask.
Step 4: Text-to-Speech with ElevenLabs
# src/tts.py
from elevenlabs import ElevenLabs, play
import io
client = ElevenLabs()
def speak(text: str, voice_id: str = "pNInz6obpgDQGcFmaJgB"):
"""Convert text to speech and play it."""
audio = client.text_to_speech.convert(
voice_id=voice_id, # "Adam" voice, change to your preferred
model_id="eleven_turbo_v2_5", # Fastest model
text=text,
output_format="mp3_22050_32",
)
# Collect the generator output
audio_bytes = b"".join(audio)
play(audio_bytes)
def speak_streaming(text: str, voice_id: str = "pNInz6obpgDQGcFmaJgB"):
"""Stream TTS for lower latency — starts playing before full generation."""
audio_stream = client.text_to_speech.convert_as_stream(
voice_id=voice_id,
model_id="eleven_turbo_v2_5",
text=text,
output_format="mp3_22050_32",
)
# Play chunks as they arrive
play(audio_stream)
Voice Selection
ElevenLabs offers dozens of pre-made voices. Pick one that matches your use case:
# List available voices
voices = client.voices.get_all()
for voice in voices.voices:
print(f"{voice.name}: {voice.voice_id}")
For a professional assistant, “Rachel” (calm, clear) or “Adam” (confident, natural) work well.
Local TTS Alternative: Coqui/XTTS
For offline/free TTS:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/fast_pitch")
def speak_local(text: str, output_path: str = "output.wav"):
tts.tts_to_file(text=text, file_path=output_path)
# Play with your system's audio player
import subprocess
subprocess.run(["afplay", output_path]) # Mac
Step 5: Putting It All Together
# main.py
from src.audio import AudioRecorder
from src.stt import transcribe
from src.reasoning import ConversationManager
from src.tts import speak_streaming
import time
def main():
print("=" * 50)
print("AI Voice Assistant")
print("Speak naturally. Say 'goodbye' to exit.")
print("=" * 50)
recorder = AudioRecorder(vad_aggressiveness=2)
conversation = ConversationManager()
# Custom system prompt (optional)
# conversation = ConversationManager(
# system_prompt="You are a cooking assistant..."
# )
try:
while True:
# 1. Listen
audio_data = recorder.record_until_silence(silence_threshold_ms=1000)
if not audio_data:
continue
# 2. Transcribe
t0 = time.time()
user_text = transcribe(audio_data)
stt_time = time.time() - t0
if not user_text:
continue
print(f"\nYou: {user_text} ({stt_time:.1f}s)")
# Check for exit
if any(word in user_text.lower() for word in ["goodbye", "exit", "quit"]):
speak_streaming("Goodbye! Have a great day.")
break
# 3. Think
t0 = time.time()
response = conversation.get_response(user_text)
llm_time = time.time() - t0
print(f"Assistant: {response} ({llm_time:.1f}s)")
# 4. Speak
t0 = time.time()
speak_streaming(response)
tts_time = time.time() - t0
print(f"[STT: {stt_time:.1f}s | LLM: {llm_time:.1f}s | TTS: {tts_time:.1f}s]")
except KeyboardInterrupt:
print("\nExiting...")
finally:
recorder.cleanup()
if __name__ == "__main__":
main()
Latency Optimization
The out-of-the-box latency is roughly:
- STT (Whisper API): 0.5-1.0s
- LLM (Claude Sonnet): 0.8-1.5s
- TTS (ElevenLabs streaming): 0.3-0.5s start, then real-time
- Total: 1.6-3.0 seconds
To get under 2 seconds:
1. Stream the LLM Response to TTS
Instead of waiting for the full Claude response before starting TTS, stream the response and start TTS as soon as the first sentence is complete:
import anthropic
from elevenlabs import ElevenLabs
import re
anthropic_client = anthropic.Anthropic()
tts_client = ElevenLabs()
def stream_response_to_speech(user_text: str, history: list, voice_id: str):
"""Stream Claude's response directly to ElevenLabs for minimal latency."""
history.append({"role": "user", "content": user_text})
buffer = ""
full_response = ""
with anthropic_client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=300,
messages=history,
) as stream:
for text in stream.text_stream:
buffer += text
full_response += text
# Check if we have a complete sentence
if re.search(r'[.!?]\s*$', buffer):
# Send this sentence to TTS
speak_streaming(buffer.strip(), voice_id)
buffer = ""
# Speak any remaining text
if buffer.strip():
speak_streaming(buffer.strip(), voice_id)
history.append({"role": "assistant", "content": full_response})
This reduces perceived latency to ~1 second — the user hears the first sentence while Claude is still generating the rest.
2. Use Local Whisper for STT
Local Whisper (base model) transcribes in ~0.05 seconds vs. 0.5-1.0 seconds for the API. The accuracy trade-off is minimal for clear speech.
3. Pre-warm Connections
Keep HTTP connections alive to reduce TLS handshake overhead:
# Use httpx with connection pooling
import httpx
http_client = httpx.Client(
timeout=30.0,
limits=httpx.Limits(max_keepalive_connections=5),
)
Adding Wake Word Detection (Optional)
If you want “Hey Assistant” style activation instead of push-to-talk:
# Using pvporcupine for wake word detection
import pvporcupine
porcupine = pvporcupine.create(
access_key="YOUR_PICOVOICE_KEY",
keywords=["jarvis"], # Built-in wake words
)
def listen_for_wake_word(recorder):
"""Listen continuously for wake word, then record."""
stream = recorder.audio.open(
format=pyaudio.paInt16,
channels=1,
rate=porcupine.sample_rate,
input=True,
frames_per_buffer=porcupine.frame_length,
)
print("Waiting for wake word...")
while True:
frame = stream.read(porcupine.frame_length)
audio_frame = np.frombuffer(frame, dtype=np.int16)
keyword_index = porcupine.process(audio_frame)
if keyword_index >= 0:
print("Wake word detected!")
stream.close()
return True
Cost Analysis
Running the voice assistant for a typical day (50 interactions):
| Component | Cost per Query | Daily (50 queries) | Monthly |
|---|---|---|---|
| Whisper API | ~$0.006 | $0.30 | $9 |
| Claude Sonnet | ~$0.01 | $0.50 | $15 |
| ElevenLabs | ~$0.02 | $1.00 | $30 |
| Total | ~$0.036 | $1.80 | $54 |
For a personal assistant, $54/month is very reasonable. For a product, this is your per-user cost baseline.
The Bottom Line
Building a voice assistant is three API calls in a loop: listen, think, speak. The complexity isn’t in the AI — it’s in the audio engineering (VAD, streaming, buffering) and the UX (latency optimization, natural conversation flow).
The code in this tutorial gives you a functional voice assistant in ~200 lines of Python. From here, you can add:
- Tool use (let Claude call APIs, search the web, control devices)
- Multi-language support (Whisper auto-detects language)
- Custom voice cloning (ElevenLabs voice cloning API)
- Visual interface (add a web UI with WebSocket audio streaming)
Stop asking chatbots to type. Start talking to them.
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
Web Scraping with AI: Build a Smart Data Extraction Pipeline
Traditional web scraping breaks when websites change layouts. AI-powered scraping understands page structure and extracts data intelligently. Here's how to build one using Python, Beautiful Soup, and Claude.
Create an AI Art Portfolio: From Generation to Gallery in One Weekend
Build a professional AI art portfolio website with curated collections, consistent style, and proper attribution. Covers prompt engineering, style consistency, curation, and deployment.
Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes
Build a Chrome extension that summarizes web pages, answers questions about content, and rewrites selected text — all powered by Claude. Full source code and step-by-step instructions included.
Tags
> Stay in the loop
Weekly AI tools & insights.