livekitlivekit-wakeword

101 stars15 forksPythonApache-2.0Updated 2d ago1view

livekit-wakeword

livekit-wakeword is an open-source wake word detection library published by LiveKit. It ships as a Python package (livekit-wakeword on PyPI), a Rust crate (livekit-wakeword on crates.io), and a Swift package targeting iOS 16+ and macOS 14+. The library is built on top of openWakeWord's audio front-end — a frozen mel-spectrogram ONNX model followed by Google's 96-dimensional speech embedding CNN — but replaces the flat DNN classification head with a Conv-Attention head that uses 1D temporal convolutions and multi-head self-attention across 16 timesteps of speech embeddings. On the "hey livekit" benchmark (15,000 positive clips, 25 hours of audio), the conv-attention head achieves 0.08 false positives per hour and 86.1% recall, compared to 8.50 FPPH / 68.6% recall for vanilla openWakeWord — a 100× reduction in false triggers.

Showcase

DET curve: openWakeWord DNN — DET Curve — openWakeWord (DNN)Open

DET curve: livekit-wakeword DNN — DET Curve — livekit-wakeword (DNN)Open

DET curve: livekit-wakeword conv-attention — DET Curve — livekit-wakeword (conv-attention)Open

DET curve: 你好 livekit VoxCPM — DET Curve — Multilingual (VoxCPM)Open

Features

Conv-Attention classifier — 1D Conv blocks + MultiheadAttention + MeanPool head that models temporal ordering of phoneme embeddings; 60× lower AUT than openWakeWord.
Backward compatible with openWakeWord models and the openWakeWord library interface.
Multilingual training — 30 languages plus 9 Chinese dialects via VoxCPM2 synthetic TTS; English uses Piper VITS with SLERP speaker blending.
Three classifier architectures — DNN (flat FC), RNN (bi-LSTM), Conv+Attention (default); four size presets (tiny / small / medium / large).
Single-YAML pipeline — one config file drives data generation, augmentation, feature extraction, 3-phase training, ONNX export, and DET-curve evaluation.
Synthetic training data — adversarial negatives via CMU-dict phoneme substitution; no real audio recordings required.
3-phase adaptive training — focal loss + embedding mixup + checkpoint averaging to minimize FPPH while maximizing recall.
INT8 quantization — optional ONNX INT8 quantization via --quantize export flag.
Python inference API — WakeWordModel (stateless predict) + WakeWordListener (async microphone capture with debouncing).
Rust crate — mel and embedding models compiled into binary; only classifier ONNX loaded at runtime; automatic resampling from 22050–384000 Hz.
Swift package — iOS 16+ / macOS 14+; ONNX Runtime with CoreML Execution Provider dispatches to ANE/GPU/CPU; AVAudioConverter handles mic resampling.
Cloud GPU training — SkyPilot integration dispatches training jobs to cloud providers (Nebius example included).
DET curve evaluation — AUT, FPPH, and recall metrics; evaluates any compatible ONNX model including openWakeWord models.
uv-native — zero dependency-conflict setup via uv sync --all-extras.

Live examples

iOS/macOS SwiftUI demo — runnable SwiftUI app for iOS + macOS using WakeWordListener with microphone capture.
Wake Word–Triggered Agent — full LiveKit agent that starts a voice session on wake word detection.

Documentation

Quick start

# Install (inference only)
pip install livekit-wakeword[listener]

# Install (full training pipeline)
pip install livekit-wakeword[train,eval,export]

# Download models and training data
livekit-wakeword setup --config configs/prod.yaml

# Train a custom wake word end-to-end
livekit-wakeword run configs/prod.yaml

Documentation

8 pages indexed · 1,169 words

READMElivekit-wakeword — Wake Word Librarygithub.com/livekit/livekit-wakeword/blob/main/README.md
livekit-wakeword

An open-source wake word library for creating voice-enabled applications. Based on openWakeWord with streamlined training: generate synthetic data, augment, train, and export from a single YAML config.

Features:
- Conv-Attention classifier: 1D temporal convolutions + multi-head self-attention replace openWakeWord's flat DNN head, delivering 60x lower AUT and 100x fewer false positives per hour than openWakeWord
- Backward compatible with openWakeWord models and library
- Multilingual support: over 30 languages via VoxCPM synthetic data generation
- Train anywhere: local machine, cloud, or spawn SkyPilot jobs
- Ships as Python package (pip/uv), Rust crate (crates.io), and Swift package (iOS 16+ / macOS 14+)
- Zero dependency headaches: uv handles everything
Benchmarks on "hey livekit" (15,000 positive clips, 45,084 negative clips, 25 hours of audio):

Metric openWakeWord (DNN) livekit-wakeword (conv-attention)

AUT 0.0720 0.0012

FPPH 8.50 0.08

Recall 68.6% 86.1%

License: Apache 2.0. Development Status: Beta (v0.2).
Architecture Overviewgithub.com/livekit/livekit-wakeword/blob/main/docs/overview.md
Architecture Overview

livekit-wakeword uses a hybrid ONNX + PyTorch architecture. Two frozen ONNX models handle feature extraction (mel spectrogram and speech embeddings), while a lightweight PyTorch classifier head is trained per wake word.

Training Pipeline: Synthetic speech generated via VITS TTS with SLERP speaker blending → augmented with noise/reverb → frozen ONNX feature extractors → lightweight classifier head → exported to ONNX.

Inference Pipeline: Raw 16kHz audio → frozen ONNX feature extractors (mel spectrogram → 96-dim speech embedding) → trained classifier (ONNX) → detection score 0–1.

Why ONNX + PyTorch?
- Fast numpy-based inference without loading PyTorch at detection time
- Shared feature extractors across all wake words
- Minimal training data needed since only the small classifier head is trained
Three classifier architectures: DNN (flat), RNN (bi-LSTM), Conv+Attention (default). The conv-attention head uses 1D convolutions + multi-head self-attention over 16 timesteps of speech embeddings — best at distinguishing wake words from phonetically similar phrases.

Module map: config.py, cli.py, models/ (feature_extractor, classifier, pipeline), data/ (generate, augment, dataset, features), training/ (trainer, metrics), eval/ (evaluate), export/ (onnx), inference/ (model, listener).
Export & Inference APIgithub.com/livekit/livekit-wakeword/blob/main/docs/export-and-inference.md
Export & Inference

The export stage converts the trained PyTorch classifier to ONNX for deployment. The inference API provides WakeWordModel for prediction and WakeWordListener for async microphone detection.

WakeWordModel

Stateless prediction API. Pass ~2 seconds of 16kHz audio, receive confidence scores per wake word.
```
from livekit.wakeword import WakeWordModel
model = WakeWordModel(models=["hey_livekit.onnx"])
scores = model.predict(audio_chunk)
if scores["hey_livekit"] > 0.5:
    print("Wake word detected!")
```
WakeWordListener

Async microphone detection with debouncing. Uses PyAudio to capture from default microphone at 16kHz/mono/int16.
```
async with WakeWordListener(model, threshold=0.5, debounce=2.0) as listener:
    detection = await listener.wait_for_detection()
```
ONNX Export + INT8 Quantization

export_classifier() exports the trained PyTorch classifier to ONNX opset 18. Optional INT8 dynamic quantization via --quantize flag.

Rust crate
```
livekit-wakeword = "0.1"
```
Mel spectrogram and speech embedding models compiled into binary; only the classifier ONNX loaded at runtime. Automatic resampling from 22050–384000 Hz to 16 kHz.

Swift package (iOS 16+ / macOS 14+)

WakeWordModel + WakeWordListener APIs match Python interface. ONNX Runtime with CoreML Execution Provider dispatches to ANE/GPU/CPU. A runnable SwiftUI demo lives in examples/ios_wakeword/.
Training Pipelinegithub.com/livekit/livekit-wakeword/blob/main/docs/training.md
Training Pipeline

3-phase adaptive training with focal loss, embedding mixup, AdamW, and checkpoint averaging.

Phase 1 — Full Training: LR warmup → hold → cosine decay. Focal loss + negative weighting + embedding mixup. 50,000 steps default.

Phase 2 — Refinement: 0.1× LR, steps/10 steps. Adaptive negative weight doubling if FPPH > target.

Phase 3 — Fine-Tuning: 0.01× LR, steps/10 steps.

Checkpoint Averaging: Select top checkpoints by FPPH + recall + accuracy, average their weights for a smoother final model.

Loss Function: Focal loss (γ=2.0) down-weights well-classified examples — eliminates manual hard-example mining. Per-sample negative weighting up to 1500× by end of phase 1.

Regularization: Label smoothing (ε=0.05) + embedding mixup (Beta(0.2,0.2) interpolation in embedding space).

Three classifiers: DNN (flatten → FC layers), RNN (Bi-LSTM, 2 layers), Conv+Attention (Conv1D blocks → MultiheadAttention → MeanPool → Linear → Sigmoid). Size presets: tiny/small/medium/large.

YAML config minimum:
```
model_name: hey_robot
target_phrases: ["hey robot"]
n_samples: 10000
model:
  model_type: conv_attention
  model_size: small
steps: 50000
target_fp_per_hour: 0.2
```
Data Generation Pipelinegithub.com/livekit/livekit-wakeword/blob/main/docs/data-generation.md
Data Generation Pipeline

Synthesizes positive and negative audio clips using a pluggable TTS backend: Piper VITS with SLERP speaker blending (English default) or VoxCPM2 voice design (multilingual).

Multilingual wake words require tts_backend: voxcpm. 30 languages supported: Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese — plus 9 Chinese dialects.

Piper VITS + SLERP: Speaker blending via Spherical Linear Interpolation over the speaker embedding space produces diverse voice variety from a single checkpoint.

Adversarial negative generation: Phoneme substitution via CMU dict generates phonetically similar phrases as hard negatives — training the model to reject near-miss sounds.

Default sample counts: 10,000 positive train / 2,000 positive test / 10,000 negative train / 2,000 negative test / 200 background train / 40 background test.

Cloud GPU training: SkyPilot integration — sky launch skypilot/train.yaml dispatches training to Nebius (or other cloud).
Augmentation Pipelinegithub.com/livekit/livekit-wakeword/blob/main/docs/augmentation.md
Augmentation Pipeline

Applies realistic audio transformations to synthetic TTS clips via AudioAugmentor.

Per-sample augmentations: SevenBandParametricEQ (25% probability), TanhDistortion (25% probability) via audiomentations library.

RIR convolution: 50% probability — convolves with room impulse response via FFT to simulate acoustic environment.

Background mixing: Random SNR 5–15 dB. Background clips also trained as standalone negative class (pure ambient = not a wake word).

Clip alignment: Positive clips end-aligned with 200ms jitter (simulates real detection scenario). Negative clips center-padded.

Stacking rounds: Each round applies transforms to the previous round's output, producing progressively degraded audio for robust training.
Feature Extraction Pipelinegithub.com/livekit/livekit-wakeword/blob/main/docs/feature-extraction.md
Feature Extraction Pipeline

Converts augmented audio clips to fixed-size embedding arrays using two frozen ONNX models.

MelSpectrogramFrontend: 32 mel bands, 10ms hop, 60–3800 Hz range. Output: (batch, time_frames, 32).

SpeechEmbedding: Google speech_embedding CNN (~330k parameters). Sliding window (76 frames, stride 8 = 80ms). Output: (batch, n_windows, 96-dim).

Timestep selection: Last 16 windows from ~2s audio window, left-padded if shorter. Final shape: (batch, 16, 96).

Shared frozen ONNX feature extractors mean only the small classifier head (tiny/small/medium/large) needs training per wake word — same front-end as openWakeWord (backward compatible).

Memory-mapped .npy batch generator for efficient dataset loading without loading full arrays into RAM.
Evaluation — DET curves, AUT, FPPHgithub.com/livekit/livekit-wakeword/blob/main/docs/evaluation.md
Evaluation

The evaluation stage runs the exported ONNX model against held-out validation data and produces a DET (Detection Error Tradeoff) curve, AUT score, and summary metrics.

AUT (Area Under the DET curve): Primary aggregate metric. Lower is better (0 = perfect). livekit-wakeword conv-attention: 0.0012 vs openWakeWord DNN: 0.0720.

FPPH (False Positives Per Hour): How many times the model falsely triggers per hour of non-wake-word audio. livekit-wakeword: 0.08 vs openWakeWord: 8.50.

Recall: True positive rate. livekit-wakeword: 86.1% vs openWakeWord: 68.6%.

Threshold optimization: Scans 0.01–0.99 to maximize recall while keeping FPPH ≤ target_fp_per_hour.
```
# Evaluate any compatible ONNX model (including openWakeWord models)
uv run livekit-wakeword eval configs/hey_livekit.yaml -m /path/to/model.onnx
```
Outputs: DET curve plot (.png) + metrics JSON. Compatible with any ONNX model sharing the (1,16,96) input contract.

Metric	openWakeWord (DNN)	livekit-wakeword (conv-attention)
AUT	0.0720	0.0012
FPPH	8.50	0.08
Recall	68.6%	86.1%

livekit-wakeword

Showcase

Features

Live examples

Documentation

Quick start

Tags

Documentation

livekit-wakeword

Architecture Overview

Export & Inference

WakeWordModel

WakeWordListener

ONNX Export + INT8 Quantization

Rust crate

Swift package (iOS 16+ / macOS 14+)

Training Pipeline

Data Generation Pipeline

Augmentation Pipeline

Feature Extraction Pipeline

Evaluation