Build an AI Voice Assistant with Whisper, Claude, and ElevenLabs in Python

Voice assistants are one of those projects that feel impossibly complex until you break them into three simple components:

Speech-to-Text (STT): Convert voice to text (Whisper)
AI Reasoning: Generate a response (Claude)
Text-to-Speech (TTS): Convert response to voice (ElevenLabs)

Each component is an API call. The engineering challenge is gluing them together with good latency, proper audio handling, and a natural conversation flow. This tutorial gives you a fully functional voice assistant in ~200 lines of Python.

Architecture

Microphone → [Audio Buffer] → Whisper (STT) → Text
                                                ↓
                                         Claude (Reasoning)
                                                ↓
                              Speaker ← ElevenLabs (TTS) ← Response Text

Total latency target: under 2 seconds from end-of-speech to start-of-response audio.

Prerequisites

pip install openai anthropic elevenlabs pyaudio numpy webrtcvad

System dependencies (Mac):

brew install portaudio ffmpeg

System dependencies (Ubuntu):

sudo apt-get install portaudio19-dev ffmpeg

Set your API keys:

export OPENAI_API_KEY="sk-..."      # For Whisper API
export ANTHROPIC_API_KEY="sk-ant-..." # For Claude
export ELEVENLABS_API_KEY="..."      # For TTS

Step 1: Audio Recording with Voice Activity Detection

The most important UX detail: the assistant should listen until you stop talking, then respond. We use WebRTC VAD (Voice Activity Detection) to detect when the user has finished speaking.

# src/audio.py
import pyaudio
import webrtcvad
import numpy as np
from collections import deque
import wave
import io


class AudioRecorder:
    """Records audio with voice activity detection."""

    def __init__(
        self,
        sample_rate: int = 16000,
        frame_duration_ms: int = 30,
        padding_duration_ms: int = 500,
        vad_aggressiveness: int = 2,
    ):
        self.sample_rate = sample_rate
        self.frame_duration_ms = frame_duration_ms
        self.frame_size = int(sample_rate * frame_duration_ms / 1000)
        self.padding_frames = int(padding_duration_ms / frame_duration_ms)
        self.vad = webrtcvad.Vad(vad_aggressiveness)  # 0-3, higher = more aggressive

        self.audio = pyaudio.PyAudio()

    def record_until_silence(self, silence_threshold_ms: int = 1000) -> bytes:
        """Record audio until the user stops speaking."""
        stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.frame_size,
        )

        frames = []
        silence_frames = 0
        speech_started = False
        silence_limit = int(silence_threshold_ms / self.frame_duration_ms)

        print("Listening...")

        try:
            while True:
                frame = stream.read(self.frame_size, exception_on_overflow=False)
                is_speech = self.vad.is_speech(frame, self.sample_rate)

                if is_speech:
                    speech_started = True
                    silence_frames = 0
                    frames.append(frame)
                elif speech_started:
                    silence_frames += 1
                    frames.append(frame)  # Keep some silence at the end

                    if silence_frames >= silence_limit:
                        print("Speech ended.")
                        break
        finally:
            stream.stop_stream()
            stream.close()

        if not frames:
            return b""

        # Convert to WAV bytes
        return self._frames_to_wav(frames)

    def _frames_to_wav(self, frames: list[bytes]) -> bytes:
        """Convert raw audio frames to WAV format."""
        buffer = io.BytesIO()
        with wave.open(buffer, "wb") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)  # 16-bit
            wf.setframerate(self.sample_rate)
            wf.writeframes(b"".join(frames))
        return buffer.getvalue()

    def cleanup(self):
        self.audio.terminate()

Why WebRTC VAD?

WebRTC VAD is fast (microseconds per frame), runs locally, and is battle-tested in production voice applications. It’s far more reliable than simple amplitude thresholding, which triggers on background noise.

Step 2: Speech-to-Text with Whisper

We use OpenAI’s Whisper API for transcription. It’s fast (~1 second for typical utterances), accurate, and handles accents well.

# src/stt.py
from openai import OpenAI
import io


client = OpenAI()


def transcribe(audio_wav: bytes) -> str:
    """Transcribe WAV audio bytes to text using Whisper."""
    if not audio_wav:
        return ""

    audio_file = io.BytesIO(audio_wav)
    audio_file.name = "recording.wav"

    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="text",
        language="en",  # Set to None for auto-detection
    )

    return transcript.strip()

Local Whisper Alternative

If you want to avoid API costs and don’t mind slower processing, run Whisper locally:

import whisper

model = whisper.load_model("base")  # Options: tiny, base, small, medium, large

def transcribe_local(audio_wav: bytes) -> str:
    """Transcribe using local Whisper model."""
    # Save to temp file (Whisper requires a file path)
    import tempfile
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        f.write(audio_wav)
        temp_path = f.name

    result = model.transcribe(temp_path)
    return result["text"].strip()

Model	Size	Speed (RTF)	Accuracy	VRAM
tiny	39M	0.03	Good	1 GB
base	74M	0.05	Better	1 GB
small	244M	0.12	Great	2 GB
medium	769M	0.25	Excellent	5 GB
large	1.5B	0.50	Best	10 GB

RTF = Real-Time Factor. RTF 0.05 means 1 second of audio is processed in 0.05 seconds.

Step 3: AI Reasoning with Claude

# src/reasoning.py
import anthropic

client = anthropic.Anthropic()


class ConversationManager:
    """Manages conversation history and Claude interactions."""

    def __init__(self, system_prompt: str = None):
        self.system_prompt = system_prompt or (
            "You are a helpful voice assistant. Keep responses concise — "
            "aim for 1-3 sentences unless the user asks for detail. "
            "Be conversational and natural. Avoid bullet points and "
            "markdown formatting since your responses will be spoken aloud."
        )
        self.history = []

    def get_response(self, user_text: str) -> str:
        """Get a response from Claude given user input."""
        self.history.append({
            "role": "user",
            "content": user_text,
        })

        message = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=300,  # Keep responses short for voice
            system=self.system_prompt,
            messages=self.history,
        )

        response_text = message.content[0].text
        self.history.append({
            "role": "assistant",
            "content": response_text,
        })

        # Keep history manageable (last 10 exchanges)
        if len(self.history) > 20:
            self.history = self.history[-20:]

        return response_text

Key Design Decision: max_tokens=300

Voice responses need to be SHORT. A 300-token response takes about 20-30 seconds to speak. Anything longer and the user loses attention. If the user wants more detail, they’ll ask.

Step 4: Text-to-Speech with ElevenLabs

# src/tts.py
from elevenlabs import ElevenLabs, play
import io


client = ElevenLabs()


def speak(text: str, voice_id: str = "pNInz6obpgDQGcFmaJgB"):
    """Convert text to speech and play it."""
    audio = client.text_to_speech.convert(
        voice_id=voice_id,  # "Adam" voice, change to your preferred
        model_id="eleven_turbo_v2_5",  # Fastest model
        text=text,
        output_format="mp3_22050_32",
    )

    # Collect the generator output
    audio_bytes = b"".join(audio)
    play(audio_bytes)


def speak_streaming(text: str, voice_id: str = "pNInz6obpgDQGcFmaJgB"):
    """Stream TTS for lower latency — starts playing before full generation."""
    audio_stream = client.text_to_speech.convert_as_stream(
        voice_id=voice_id,
        model_id="eleven_turbo_v2_5",
        text=text,
        output_format="mp3_22050_32",
    )

    # Play chunks as they arrive
    play(audio_stream)

Voice Selection

ElevenLabs offers dozens of pre-made voices. Pick one that matches your use case:

# List available voices
voices = client.voices.get_all()
for voice in voices.voices:
    print(f"{voice.name}: {voice.voice_id}")

For a professional assistant, “Rachel” (calm, clear) or “Adam” (confident, natural) work well.

Local TTS Alternative: Coqui/XTTS

For offline/free TTS:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/fast_pitch")

def speak_local(text: str, output_path: str = "output.wav"):
    tts.tts_to_file(text=text, file_path=output_path)
    # Play with your system's audio player
    import subprocess
    subprocess.run(["afplay", output_path])  # Mac

Step 5: Putting It All Together

# main.py
from src.audio import AudioRecorder
from src.stt import transcribe
from src.reasoning import ConversationManager
from src.tts import speak_streaming
import time


def main():
    print("=" * 50)
    print("AI Voice Assistant")
    print("Speak naturally. Say 'goodbye' to exit.")
    print("=" * 50)

    recorder = AudioRecorder(vad_aggressiveness=2)
    conversation = ConversationManager()

    # Custom system prompt (optional)
    # conversation = ConversationManager(
    #     system_prompt="You are a cooking assistant..."
    # )

    try:
        while True:
            # 1. Listen
            audio_data = recorder.record_until_silence(silence_threshold_ms=1000)
            if not audio_data:
                continue

            # 2. Transcribe
            t0 = time.time()
            user_text = transcribe(audio_data)
            stt_time = time.time() - t0

            if not user_text:
                continue

            print(f"\nYou: {user_text} ({stt_time:.1f}s)")

            # Check for exit
            if any(word in user_text.lower() for word in ["goodbye", "exit", "quit"]):
                speak_streaming("Goodbye! Have a great day.")
                break

            # 3. Think
            t0 = time.time()
            response = conversation.get_response(user_text)
            llm_time = time.time() - t0
            print(f"Assistant: {response} ({llm_time:.1f}s)")

            # 4. Speak
            t0 = time.time()
            speak_streaming(response)
            tts_time = time.time() - t0

            print(f"[STT: {stt_time:.1f}s | LLM: {llm_time:.1f}s | TTS: {tts_time:.1f}s]")

    except KeyboardInterrupt:
        print("\nExiting...")
    finally:
        recorder.cleanup()


if __name__ == "__main__":
    main()

Latency Optimization

The out-of-the-box latency is roughly:

STT (Whisper API): 0.5-1.0s
LLM (Claude Sonnet): 0.8-1.5s
TTS (ElevenLabs streaming): 0.3-0.5s start, then real-time
Total: 1.6-3.0 seconds

To get under 2 seconds:

1. Stream the LLM Response to TTS

Instead of waiting for the full Claude response before starting TTS, stream the response and start TTS as soon as the first sentence is complete:

import anthropic
from elevenlabs import ElevenLabs
import re

anthropic_client = anthropic.Anthropic()
tts_client = ElevenLabs()


def stream_response_to_speech(user_text: str, history: list, voice_id: str):
    """Stream Claude's response directly to ElevenLabs for minimal latency."""
    history.append({"role": "user", "content": user_text})

    buffer = ""
    full_response = ""

    with anthropic_client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=300,
        messages=history,
    ) as stream:
        for text in stream.text_stream:
            buffer += text
            full_response += text

            # Check if we have a complete sentence
            if re.search(r'[.!?]\s*$', buffer):
                # Send this sentence to TTS
                speak_streaming(buffer.strip(), voice_id)
                buffer = ""

    # Speak any remaining text
    if buffer.strip():
        speak_streaming(buffer.strip(), voice_id)

    history.append({"role": "assistant", "content": full_response})

This reduces perceived latency to ~1 second — the user hears the first sentence while Claude is still generating the rest.

2. Use Local Whisper for STT

Local Whisper (base model) transcribes in ~0.05 seconds vs. 0.5-1.0 seconds for the API. The accuracy trade-off is minimal for clear speech.

3. Pre-warm Connections

Keep HTTP connections alive to reduce TLS handshake overhead:

# Use httpx with connection pooling
import httpx

http_client = httpx.Client(
    timeout=30.0,
    limits=httpx.Limits(max_keepalive_connections=5),
)

Adding Wake Word Detection (Optional)

If you want “Hey Assistant” style activation instead of push-to-talk:

# Using pvporcupine for wake word detection
import pvporcupine

porcupine = pvporcupine.create(
    access_key="YOUR_PICOVOICE_KEY",
    keywords=["jarvis"],  # Built-in wake words
)

def listen_for_wake_word(recorder):
    """Listen continuously for wake word, then record."""
    stream = recorder.audio.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=porcupine.sample_rate,
        input=True,
        frames_per_buffer=porcupine.frame_length,
    )

    print("Waiting for wake word...")
    while True:
        frame = stream.read(porcupine.frame_length)
        audio_frame = np.frombuffer(frame, dtype=np.int16)
        keyword_index = porcupine.process(audio_frame)

        if keyword_index >= 0:
            print("Wake word detected!")
            stream.close()
            return True

Cost Analysis

Running the voice assistant for a typical day (50 interactions):

Component	Cost per Query	Daily (50 queries)	Monthly
Whisper API	~$0.006	$0.30	$9
Claude Sonnet	~$0.01	$0.50	$15
ElevenLabs	~$0.02	$1.00	$30
Total	~$0.036	$1.80	$54

For a personal assistant, $54/month is very reasonable. For a product, this is your per-user cost baseline.

The Bottom Line

Building a voice assistant is three API calls in a loop: listen, think, speak. The complexity isn’t in the AI — it’s in the audio engineering (VAD, streaming, buffering) and the UX (latency optimization, natural conversation flow).

The code in this tutorial gives you a functional voice assistant in ~200 lines of Python. From here, you can add:

Tool use (let Claude call APIs, search the web, control devices)
Multi-language support (Whisper auto-detects language)
Custom voice cloning (ElevenLabs voice cloning API)
Visual interface (add a web UI with WebSocket audio streaming)

Stop asking chatbots to type. Start talking to them.

Build an AI Voice Assistant with Whisper, Claude, and ElevenLabs in Python

Architecture

Prerequisites

Step 1: Audio Recording with Voice Activity Detection

Why WebRTC VAD?

Step 2: Speech-to-Text with Whisper

Local Whisper Alternative

Step 3: AI Reasoning with Claude

Key Design Decision: max_tokens=300

Step 4: Text-to-Speech with ElevenLabs

Voice Selection

Local TTS Alternative: Coqui/XTTS

Step 5: Putting It All Together

Latency Optimization

1. Stream the LLM Response to TTS

2. Use Local Whisper for STT

3. Pre-warm Connections

Adding Wake Word Detection (Optional)

Cost Analysis

The Bottom Line

Sources

Share this article

> Want more like this?

> Related Articles

Web Scraping with AI: Build a Smart Data Extraction Pipeline

Create an AI Art Portfolio: From Generation to Gallery in One Weekend

Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes

Tags

> Stay in the loop