| Title: | Text-to-Speech Using the 'Chatterbox' Engine |
|---|---|
| Description: | A native R 'torch' port of the 'Chatterbox' text-to-speech engine <https://github.com/resemble-ai/chatterbox>. Provides speech synthesis with voice cloning; model weights are downloaded from 'HuggingFace' <https://huggingface.co/> via the 'hfhub' package. |
| Authors: | Troy Hernandez [aut, cre] (ORCID: <https://orcid.org/0009-0005-4248-604X>), cornball.ai [cph], Resemble AI [cph] (Chatterbox model architecture and weights (MIT license)) |
| Maintainer: | Troy Hernandez <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.1 |
| Built: | 2026-06-25 17:21:19 UTC |
| Source: | https://github.com/cran/chatterbox |
Apply Llama3-style RoPE scaling
apply_llama3_rope_scaling(inv_freq, scaling, dim)apply_llama3_rope_scaling(inv_freq, scaling, dim)
inv_freq |
Inverse frequencies |
scaling |
Scaling configuration |
dim |
Dimension |
Scaled inverse frequencies
Apply rotary position embeddings
apply_rotary_emb_s3(xq, xk, freqs_cis)apply_rotary_emb_s3(xq, xk, freqs_cis)
xq |
Query tensor |
xk |
Key tensor |
freqs_cis |
Precomputed frequencies |
List with rotated q and k
Apply rotary position embeddings to Q and K
apply_rotary_pos_emb(q, k, cos, sin, position_ids)apply_rotary_pos_emb(q, k, cos, sin, position_ids)
q |
Query tensor (batch, heads, seq, head_dim) |
k |
Key tensor (batch, heads, seq, head_dim) |
cos |
Cosine cache |
sin |
Sine cache |
position_ids |
Position indices |
List with rotated q and k
Attention block for perceiver
attention_block(embed_dim = 1024, num_heads = 4)attention_block(embed_dim = 1024, num_heads = 4)
embed_dim |
Embedding dimension (default 1024) |
num_heads |
Number of attention heads (default 4) |
nn_module
Basic residual block for FCM
basic_res_block(in_planes, planes, stride = 1)basic_res_block(in_planes, planes, stride = 1)
in_planes |
Input channels |
planes |
Output channels |
stride |
Stride for downsampling |
nn_module
Basic transformer block
basic_transformer_block(dim, num_heads = 8L)basic_transformer_block(dim, num_heads = 8L)
dim |
Hidden dimension |
num_heads |
Number of attention heads |
nn_module
CAM Dense TDNN Block (multiple layers with dense connections)
cam_dense_tdnn_block(num_layers, in_channels, out_channels, bn_channels, kernel_size, dilation = 1)cam_dense_tdnn_block(num_layers, in_channels, out_channels, bn_channels, kernel_size, dilation = 1)
num_layers |
Number of layers |
in_channels |
Input channels |
out_channels |
Output channels per layer |
bn_channels |
Bottleneck channels |
kernel_size |
Kernel size |
dilation |
Dilation |
nn_module
CAM Dense TDNN Layer
cam_dense_tdnn_layer(in_channels, out_channels, bn_channels, kernel_size, dilation = 1)cam_dense_tdnn_layer(in_channels, out_channels, bn_channels, kernel_size, dilation = 1)
in_channels |
Input channels |
out_channels |
Output channels |
bn_channels |
Bottleneck channels |
kernel_size |
Kernel size |
dilation |
Dilation |
nn_module
CAM (Context-Aware Masking) Layer
cam_layer(bn_channels, out_channels, kernel_size, stride = 1, padding = 0, dilation = 1, reduction = 2)cam_layer(bn_channels, out_channels, kernel_size, stride = 1, padding = 0, dilation = 1, reduction = 2)
bn_channels |
Bottleneck channels |
out_channels |
Output channels |
kernel_size |
Kernel size |
stride |
Stride |
padding |
Padding |
dilation |
Dilation |
reduction |
Channel reduction factor |
nn_module
CAMPPlus speaker encoder
campplus(feat_dim = 80, embedding_size = 192, growth_rate = 32, init_channels = 128)campplus(feat_dim = 80, embedding_size = 192, growth_rate = 32, init_channels = 128)
feat_dim |
Input feature dimension (default 80) |
embedding_size |
Output embedding size (default 192) |
growth_rate |
Dense block growth rate (default 32) |
init_channels |
Initial TDNN channels (default 128) |
nn_module
Causal Block 1D - CausalConv + LayerNorm + Mish
causal_block1d(in_channels, out_channels, kernel_size = 3L)causal_block1d(in_channels, out_channels, kernel_size = 3L)
in_channels |
Input channels |
out_channels |
Output channels |
kernel_size |
Kernel size |
nn_module
Causal Conditional Flow Matching
causal_cfm(in_channels = 320, out_channels = 80, spk_emb_dim = 80, meanflow = FALSE)causal_cfm(in_channels = 320, out_channels = 80, spk_emb_dim = 80, meanflow = FALSE)
in_channels |
Input channels (x + mu + spks + cond) |
out_channels |
Output channels (mel bins) |
spk_emb_dim |
Speaker embedding dimension |
meanflow |
Logical. Use mean-flow formulation. Default FALSE. |
nn_module
Causal Conv1d - pads left only
causal_conv1d(in_channels, out_channels, kernel_size, stride = 1L, dilation = 1L)causal_conv1d(in_channels, out_channels, kernel_size, stride = 1L, dilation = 1L)
in_channels |
Input channels |
out_channels |
Output channels |
kernel_size |
Kernel size |
stride |
Stride (default 1) |
dilation |
Dilation (default 1) |
nn_module
Causal Masked Diff with Xvector
causal_masked_diff_xvec(vocab_size = 6561, input_size = 512, output_size = 80, spk_embed_dim = 192, input_frame_rate = 25, token_mel_ratio = 2, meanflow = FALSE)causal_masked_diff_xvec(vocab_size = 6561, input_size = 512, output_size = 80, spk_embed_dim = 192, input_frame_rate = 25, token_mel_ratio = 2, meanflow = FALSE)
vocab_size |
Speech token vocabulary size |
input_size |
Token embedding size |
output_size |
Mel bins |
spk_embed_dim |
Speaker embedding dimension |
input_frame_rate |
Input frame rate for audio processing |
token_mel_ratio |
Ratio of tokens to mel frames |
meanflow |
Logical. Use mean-flow formulation. Default FALSE. |
nn_module
Causal ResNet Block 1D
causal_resnet_block1d(in_channels, out_channels, time_embed_dim = 1024L)causal_resnet_block1d(in_channels, out_channels, time_embed_dim = 1024L)
in_channels |
Input channels |
out_channels |
Output channels |
time_embed_dim |
Time embedding dimension |
nn_module
Self-attention for transformer block
cfm_attention(dim, num_heads = 8L, head_dim = 64L)cfm_attention(dim, num_heads = 8L, head_dim = 64L)
dim |
Hidden dimension |
num_heads |
Number of attention heads |
head_dim |
Head dimension (default 64) |
nn_module
UNet-style architecture with: - 1 down block (320 -> 256, with 4 transformer blocks) - 12 mid blocks (256 -> 256, each with 4 transformer blocks) - 1 up block (512 -> 256 with skip connection, 4 transformer blocks)
cfm_estimator(in_channels = 320L, out_channels = 80L, hidden_dim = 256L, num_mid_blocks = 12L, num_transformer_blocks = 4L, meanflow = FALSE)cfm_estimator(in_channels = 320L, out_channels = 80L, hidden_dim = 256L, num_mid_blocks = 12L, num_transformer_blocks = 4L, meanflow = FALSE)
in_channels |
Input channels (default 320 = x + mu + spks + cond) |
out_channels |
Output channels (default 80 = mel bins) |
|
Hidden dimension (default 256) |
|
num_mid_blocks |
Number of mid blocks (default 12) |
num_transformer_blocks |
Transformer blocks per layer (default 4) |
meanflow |
Logical. Use mean-flow formulation. Default FALSE. |
nn_module
Constructs the model object and, by default, loads the pretrained
weights in the same call - the Python reference's
from_pretrained/from_local do both at once. Pass
load = FALSE for the bare object (e.g. to inspect it or test
the not-loaded error paths), then load later with
load_chatterbox.
chatterbox(device = "cpu", turbo = FALSE, load = TRUE, tune_gc = TRUE)chatterbox(device = "cpu", turbo = FALSE, load = TRUE, tune_gc = TRUE)
device |
Device to use ("cpu", "cuda", "mps", etc.) |
turbo |
Use turbo model (GPT-2 backbone, MeanFlow decoder). Default FALSE. |
load |
Load pretrained weights before returning. Default TRUE.
Requires a prior download ( |
tune_gc |
Tune torch's CUDA GC rates for faster inference (CUDA only, and only when unset). Persistent session side effect; default TRUE. See Details. |
When tune_gc = TRUE (the default) and device is CUDA, this
raises torch's allocator GC floors before the first CUDA op. torch otherwise
runs gc() on nearly every allocation once a model occupies more than
20\
torch.cuda_allocator_reserved_rate (the model footprint over VRAM) and
torch.threshold_call_gc, only when they are unset, so an explicit
setting always wins. This is a deliberate, persistent side effect (torch
reads the rates later, at CUDA init); pass tune_gc = FALSE to skip it.
Chatterbox TTS model object, loaded unless load = FALSE
## Not run: # Construct and load the standard model on GPU model <- chatterbox("cuda") # Bare object without weights (load later with load_chatterbox()) model <- chatterbox("cuda", load = FALSE) ## End(Not run)## Not run: # Construct and load the standard model on GPU model <- chatterbox("cuda") # Bare object without weights (load later with load_chatterbox()) model <- chatterbox("cuda", load = FALSE) ## End(Not run)
torch's allocators invoke a full R garbage collection based on settings that are read ONCE at torch startup. The default CUDA trigger (collections begin once torch reserves 20 percent of the card) sits below chatterbox's ~4.6 GB loaded footprint on most GPUs, which makes autoregressive inference collection-bound: ~91 percent of pure-R generation wall time is GC, and even compiled-loop backends are throttled by it (their allocations flow through the same allocator). With the trigger line above the model floor, pure R runs ~10x faster and the jit backend ~15x.
chatterbox_gc_options(vram_gb = NULL)chatterbox_gc_options(vram_gb = NULL)
vram_gb |
Total GPU memory in GB. Default: detected via nvidia-smi, falling back to 16. |
Only one option matters for speed:
torch.cuda_allocator_reserved_rate. Sweeps over 0.3-0.8 all
give identical speed; the value just chooses how high the VRAM
plateau sits. torch.cuda_allocator_allocated_rate (default
0.8) caps that plateau and can be lowered to ~0.6 on shared GPUs at
no speed cost. The remaining settings
(torch.threshold_call_gc,
torch.cuda_allocator_allocated_reserved_rate) measured as not
worth touching.
This helper does not (and cannot) change the settings for the current
session: torch reads them at initialization, so they belong in your
.Rprofile or at the very top of a script, before torch loads. Printing
the returned object shows the exact snippet for this machine; the
helper warns when torch is already initialized. Scripts can apply the
values directly with do.call(options, chatterbox_gc_options())
(again: before torch loads).
Rule of thumb for loops: collect once per utterance, not thousands of
times inside it. tts_chunked does this automatically;
in your own batch loops, call gc() after each
generate().
A named list of the recommended options() values,
classed "chatterbox_gc_options" so it prints as the full
tuning advice for this machine.
chatterbox_gc_options(vram_gb = 16)chatterbox_gc_options(vram_gb = 16)
Compute rotary position embeddings frequencies
compute_rope_frequencies(dim, max_seq_len, theta = 5e+05, scaling = NULL, device = "cpu")compute_rope_frequencies(dim, max_seq_len, theta = 5e+05, scaling = NULL, device = "cpu")
dim |
Dimension of embeddings |
max_seq_len |
Maximum sequence length |
theta |
Base frequency |
scaling |
Rope scaling configuration (optional) |
device |
Device to create tensors on |
List with cos and sin caches
Uses power spectrum (magnitude^2) without log compression, matching Python's mel_type="amp" and mel_power=2.0.
compute_ve_mel(wav, config = voice_encoder_config())compute_ve_mel(wav, config = voice_encoder_config())
wav |
Audio samples (numeric vector) |
config |
Voice encoder config |
Mel spectrogram (batch, time, n_mels)
Single conformer block with attention and feed-forward (no convolution).
conformer_encoder_layer(n_feat = 512, n_head = 8, n_ffn = 2048, dropout_rate = 0.1)conformer_encoder_layer(n_feat = 512, n_head = 8, n_ffn = 2048, dropout_rate = 0.1)
n_feat |
Feature dimension |
n_head |
Number of attention heads |
n_ffn |
Feed-forward hidden dimension |
dropout_rate |
Dropout rate |
nn_module
Convolutional RNN F0 Predictor
conv_rnn_f0_predictor(in_channels = 80, cond_channels = 512)conv_rnn_f0_predictor(in_channels = 80, cond_channels = 512)
in_channels |
Input channels (mel bins) |
cond_channels |
Hidden channels |
nn_module
Create pre-allocated KV cache
create_kv_cache(batch_size, n_layers, n_heads, head_dim, max_len, device)create_kv_cache(batch_size, n_layers, n_heads, head_dim, max_len, device)
batch_size |
Batch size |
n_layers |
Number of transformer layers |
n_heads |
Number of attention heads |
head_dim |
Head dimension |
max_len |
Maximum sequence length |
device |
Device to allocate on |
List with k_cache, v_cache, valid_mask
Create mel filterbank
create_mel_filterbank(sr, n_fft, n_mels, fmin = 0, fmax = NULL, norm = "slaney", htk = FALSE)create_mel_filterbank(sr, n_fft, n_mels, fmin = 0, fmax = NULL, norm = "slaney", htk = FALSE)
sr |
Sample rate |
n_fft |
FFT size |
n_mels |
Number of mel bins |
fmin |
Minimum frequency |
fmax |
Maximum frequency |
norm |
Character. Normalization type. Default "slaney". |
htk |
Logical. Use HTK formula. Default FALSE. |
Mel filterbank matrix (n_mels x (n_fft/2 + 1))
Create voice embedding from reference audio
create_voice_embedding(model, audio, sample_rate = NULL, autocast = NULL, norm_loudness = NULL)create_voice_embedding(model, audio, sample_rate = NULL, autocast = NULL, norm_loudness = NULL)
model |
Chatterbox model |
audio |
Reference audio (file path, numeric vector, or torch tensor) |
sample_rate |
Sample rate of audio (if not a file) |
autocast |
Ignored (kept for API compatibility) |
norm_loudness |
Normalize the reference to -27 LUFS before
conditioning ( |
Voice embedding that can be used for synthesis
## Not run: model <- chatterbox("cuda") voice <- create_voice_embedding(model, "reference_voice.wav") res <- generate(model, "Reusing a cached voice.", voice) ## End(Not run)## Not run: model <- chatterbox("cuda") voice <- create_voice_embedding(model, "reference_voice.wav") res <- generate(model, "Reusing a cached voice.", voice) ## End(Not run)
Dense layer for final embedding
dense_layer(in_channels, out_channels)dense_layer(in_channels, out_channels)
in_channels |
Input channels |
out_channels |
Output channels |
nn_module
Download all Chatterbox model files from HuggingFace. In interactive sessions, asks for user consent before downloading.
download_chatterbox_models(force = FALSE)download_chatterbox_models(force = FALSE)
force |
Re-download even if files exist |
Named list of local file paths (invisibly)
## Not run: # Download models (~2GB) download_chatterbox_models() ## End(Not run)## Not run: # Download models (~2GB) download_chatterbox_models() ## End(Not run)
Download all Chatterbox Turbo model files from HuggingFace. The turbo model uses a GPT-2 backbone and MeanFlow decoder for faster inference.
download_chatterbox_turbo_models(force = FALSE)download_chatterbox_turbo_models(force = FALSE)
force |
Re-download even if files exist |
Named list of local file paths (invisibly)
## Not run: download_chatterbox_turbo_models() ## End(Not run)## Not run: download_chatterbox_turbo_models() ## End(Not run)
Python parity: slice to the span after the first SOS and before the first EOS (s3tokenizer drop_invalid_tokens), then drop any remaining out-of-vocab ids (tts.py filters < SPEECH_VOCAB_SIZE on top). The old elementwise filter kept post-EOS garbage when generation hit the token cap with a mid-stream EOS.
drop_invalid_tokens(tokens)drop_invalid_tokens(tokens)
tokens |
Token tensor or integer vector (0-indexed values) |
Filtered tokens, same type as input
Creates sinusoidal positional embeddings for use with relative position attention. Includes scaling by sqrt(d_model) and adds positional embedding to the input.
espnet_rel_positional_encoding(d_model = 512, dropout_rate = 0.1, max_len = 5000)espnet_rel_positional_encoding(d_model = 512, dropout_rate = 0.1, max_len = 5000)
d_model |
Model dimension |
dropout_rate |
Numeric. Dropout rate. Default 0.1. |
max_len |
Maximum sequence length |
nn_module
Factorized Convolutional Module (FCM)
fcm_module(m_channels = 32, feat_dim = 80)fcm_module(m_channels = 32, feat_dim = 80)
m_channels |
Number of channels |
feat_dim |
Input feature dimension (mel bins) |
nn_module
Feed-forward network for transformer Matches diffusers FeedForward: net = [GELU(proj), Dropout, Linear]
feed_forward(dim, hidden_dim = NULL)feed_forward(dim, hidden_dim = NULL)
dim |
Input dimension |
|
Hidden dimension (typically 4x dim) |
nn_module
Multi-head attention with Frequency-domain Self-attention Memory Network
fsmn_multi_head_attention(n_state, n_head, kernel_size = 31L)fsmn_multi_head_attention(n_state, n_head, kernel_size = 31L)
n_state |
Hidden dimension |
n_head |
Number of heads |
kernel_size |
FSMN kernel size (default 31) |
nn_module
FSQ Codebook module
fsq_codebook(dim, level = 3L)fsq_codebook(dim, level = 3L)
dim |
Input dimension (n_audio_state) |
level |
Quantization level (default 3) |
nn_module
FSQ Vector Quantization wrapper
fsq_vector_quantization(dim, codebook_size = 6561L)fsq_vector_quantization(dim, codebook_size = 6561L)
dim |
Input dimension |
codebook_size |
Codebook size (must be 6561 = 3^8) |
nn_module
GELU activation with projection (matches diffusers GELU structure)
gelu_with_proj(dim_in, dim_out)gelu_with_proj(dim_in, dim_out)
dim_in |
Input dimension |
dim_out |
Output dimension |
nn_module
Generate speech from text
generate(model, text, voice, exaggeration = 0.5, cfg_weight = 0.5, temperature = 0.8, top_p = 1, min_p = 0.05, autocast = NULL, traced = FALSE, backend = c("r", "jit"), top_k = 1000L, repetition_penalty = 1.2, normalize_text = FALSE, max_new_tokens = 1000L, max_cache_len = NULL, cfm_len = NULL, skip_vocoder = FALSE, output_path = NULL)generate(model, text, voice, exaggeration = 0.5, cfg_weight = 0.5, temperature = 0.8, top_p = 1, min_p = 0.05, autocast = NULL, traced = FALSE, backend = c("r", "jit"), top_k = 1000L, repetition_penalty = 1.2, normalize_text = FALSE, max_new_tokens = 1000L, max_cache_len = NULL, cfm_len = NULL, skip_vocoder = FALSE, output_path = NULL)
model |
Chatterbox model |
text |
Text to synthesize |
voice |
Voice embedding from create_voice_embedding() or path to reference audio |
exaggeration |
Emotion/expression exaggeration level (0-1, default 0.5) |
cfg_weight |
Classifier-free guidance weight (higher = more adherence to text, default 0.5) |
temperature |
Sampling temperature (default 0.8) |
top_p |
Top-p (nucleus) sampling threshold. Default 1.0 (disabled), matching the Python reference. |
min_p |
Minimum probability threshold relative to the most likely token (default 0.05, matching the Python reference). Standard model only. |
autocast |
Use mixed precision (float16) on CUDA for faster inference. Default FALSE: the Python reference runs float32, and float16 output diverges slightly. Opt in for speed on tight VRAM. |
traced |
Logical. Use JIT-traced inference. Default FALSE. |
backend |
Character. Inference backend, either "r" or "jit".
Default "r". The jit backend runs each token's full 30-layer
forward as one TorchScript call (compiled once per session, in
milliseconds, via |
top_k |
Integer. Top-k sampling parameter (turbo model only). Default 1000. |
repetition_penalty |
Numeric. Repetition penalty. Default 1.2. Applied sign-dependently like HF transformers: positive logits are divided, negative ones multiplied. |
normalize_text |
Logical. Apply the R-specific internal-caps mitigation (lowercase words with internal capitals). Default FALSE: it addressed a "first word then silence" failure that was actually the column-major/STFT bug, now fixed - with the model corrected, leaving caps intact also preserves intended emphasis (ALL CAPS reads as emphasis) and acronyms. Set TRUE only if specific text misbehaves. Punctuation normalization (whitespace collapse, first-letter capitalization, trailing period) always runs, matching the Python reference implementation. |
max_new_tokens |
Maximum speech tokens to generate (default 1000, = 40 s of audio; the model's own ceiling is 4096). |
max_cache_len |
KV cache positions for the jit and traced backends. Default NULL: jit auto-sizes so generation always fits (~1 MB VRAM per position); traced keeps its 350-position trace (a new size triggers a fresh ~50 s trace). Ignored by the pure-R backend, which has no pre-allocated cache. |
cfm_len |
Optional explicit traced-CFM length (the padded mel sequence, = 640 + 2 * tokens). Default NULL: the standard traced path sizes it from the tokens actually generated, rounded up to the 250/500/1000 bucket ladder, so a slow speaker is covered without guessing from text length. Pass a value to pin it (e.g. to pre-trace a bucket). Ignored when not traced or for turbo. |
skip_vocoder |
Logical. If TRUE, stop after flow matching and
return the mel spectrogram instead of audio (Python 0.1.7's
|
output_path |
Optional WAV path. When set, the audio is also
written there (as a side effect) and the returned list gains a
|
List with elements:
Numeric vector of audio samples (omitted when
skip_vocoder = TRUE, which returns mel instead)
Sample rate in Hz
Logical. Whether the model emitted an end-of-speech token (TRUE) or hit the token cap (FALSE). FALSE often indicates garbage output and a need to retry or split the input.
Number of speech tokens generated
Audio duration in seconds
Output file path (only when output_path is set)
## Not run: model <- chatterbox("cuda") res <- generate(model, "Hello world!", "reference_voice.wav") write_audio(res$audio, res$sample_rate, "hello.wav") # Fastest native path: TorchScript decode loop res <- generate(model, "Hello world!", "reference_voice.wav", backend = "jit") ## End(Not run)## Not run: model <- chatterbox("cuda") res <- generate(model, "Hello world!", "reference_voice.wav") write_audio(res$audio, res$sample_rate, "hello.wav") # Fastest native path: TorchScript decode loop res <- generate(model, "Hello world!", "reference_voice.wav", backend = "jit") ## End(Not run)
Runs T3 token generation per text (autoregressive, sequential), then
synthesizes ALL utterances in a single batched S3Gen pass (one CFM
solve and one vocoder call over the padded batch). Per-utterance
results match single generate calls up to CFM noise
handling - the fixed noise buffer means row i sees the same initial
noise it would alone. Standard model only.
generate_batch(model, texts, voice, ...)generate_batch(model, texts, voice, ...)
model |
Loaded chatterbox model (standard, not turbo) |
texts |
Character vector of texts to synthesize |
voice |
Shared voice: voice_embedding or reference audio path |
... |
Arguments passed through to the T3 stage, as in
|
List with one generate-style result per text
(audio, sample_rate, eos_found, n_tokens, audio_sec)
## Not run: model <- chatterbox("cuda") res <- generate_batch(model, c("First sentence.", "Second sentence."), "reference_voice.wav") write_audio(res[[1]]$audio, res[[1]]$sample_rate, "first.wav") ## End(Not run)## Not run: model <- chatterbox("cuda") res <- generate_batch(model, c("First sentence.", "Second sentence."), "reference_voice.wav") write_audio(res[[1]]$audio, res[[1]]$sample_rate, "first.wav") ## End(Not run)
Get padding for convolution
get_conv_padding(kernel_size, dilation = 1)get_conv_padding(kernel_size, dilation = 1)
kernel_size |
Kernel size |
dilation |
Dilation rate |
Padding size
Get or create traced layers for cached inference
get_traced_layers(model, max_cache_len = 350L)get_traced_layers(model, max_cache_len = 350L)
model |
T3 model |
max_cache_len |
Maximum cache length |
List of traced layer modules
GPT-2 Attention (combined QKV projection)
gpt2_attention(config)gpt2_attention(config)
config |
GPT-2 config |
nn_module
GPT-2 Transformer Block
gpt2_block(config)gpt2_block(config)
config |
GPT-2 config |
nn_module
GPT-2 Model Configuration
gpt2_config()gpt2_config()
List with GPT-2 medium config
GPT-2 Layer Normalization
gpt2_layer_norm(hidden_size, eps = 1e-05)gpt2_layer_norm(hidden_size, eps = 1e-05)
|
Dimension |
|
eps |
Epsilon |
nn_module
GPT-2 MLP (GELU activation)
gpt2_mlp(config)gpt2_mlp(config)
config |
GPT-2 config |
nn_module
GPT-2 Model (transformer backbone)
gpt2_model(config = NULL)gpt2_model(config = NULL)
config |
GPT-2 configuration |
nn_module
HiFiGAN Residual Block
hifigan_resblock(channels = 512, kernel_size = 3, dilations = c(1, 3, 5))hifigan_resblock(channels = 512, kernel_size = 3, dilations = c(1, 3, 5))
channels |
Number of channels |
kernel_size |
Kernel size |
dilations |
List of dilation rates |
nn_module
Neural Source Filter + ISTFTNet Reference: https://arxiv.org/abs/2309.09493
hift_generator(in_channels = 80, base_channels = 512, nb_harmonics = 8, sampling_rate = 22050, nsf_alpha = 0.1, nsf_sigma = 0.003, nsf_voiced_threshold = 10, upsample_rates = c(8, 8), upsample_kernel_sizes = c(16, 16), istft_n_fft = 16, istft_hop_len = 4, resblock_kernel_sizes = c(3, 7, 11), resblock_dilation_sizes = list(c(1, 3, 5), c(1, 3, 5), c(1, 3, 5)), source_resblock_kernel_sizes = c(7, 11), source_resblock_dilation_sizes = list(c(1, 3, 5), c(1, 3, 5)), lrelu_slope = 0.1, audio_limit = 0.99)hift_generator(in_channels = 80, base_channels = 512, nb_harmonics = 8, sampling_rate = 22050, nsf_alpha = 0.1, nsf_sigma = 0.003, nsf_voiced_threshold = 10, upsample_rates = c(8, 8), upsample_kernel_sizes = c(16, 16), istft_n_fft = 16, istft_hop_len = 4, resblock_kernel_sizes = c(3, 7, 11), resblock_dilation_sizes = list(c(1, 3, 5), c(1, 3, 5), c(1, 3, 5)), source_resblock_kernel_sizes = c(7, 11), source_resblock_dilation_sizes = list(c(1, 3, 5), c(1, 3, 5)), lrelu_slope = 0.1, audio_limit = 0.99)
in_channels |
Input mel channels |
base_channels |
Base channel count |
nb_harmonics |
Number of harmonics for source filter |
sampling_rate |
Output sample rate |
nsf_alpha |
NSF sine amplitude |
nsf_sigma |
NSF noise std |
nsf_voiced_threshold |
F0 voiced threshold |
upsample_rates |
Upsampling rates |
upsample_kernel_sizes |
Upsampling kernel sizes |
istft_n_fft |
ISTFT FFT size |
istft_hop_len |
ISTFT hop length |
resblock_kernel_sizes |
ResBlock kernel sizes |
resblock_dilation_sizes |
ResBlock dilations |
source_resblock_kernel_sizes |
Source resblock kernels |
source_resblock_dilation_sizes |
Source resblock dilations |
lrelu_slope |
LeakyReLU slope |
audio_limit |
Output clipping limit |
nn_module
Initialize cache with first token K/V values
init_cache_from_first(cache, past_key_values)init_cache_from_first(cache, past_key_values)
cache |
Cache list from create_kv_cache |
past_key_values |
List of K/V from first forward pass |
Updated cache (and seq_len as attribute)
Measures the integrated gated loudness of a mono signal in LUFS, matching pyloudnorm's K-weighting meter (the measurement Python chatterbox turbo applies to reference audio).
integrated_loudness(samples, sample_rate)integrated_loudness(samples, sample_rate)
samples |
Numeric vector of mono audio samples. |
sample_rate |
Sample rate in Hz. |
Loudness in LUFS (-Inf for silence or when no block
passes the gates).
samples <- sin(2 * pi * 440 * seq(0, 1, length.out = 48000)) integrated_loudness(samples, 48000)samples <- sin(2 * pi * 440 * seq(0, 1, length.out = 48000)) integrated_loudness(samples, 48000)
Check if model is loaded
is_loaded(model)is_loaded(model)
model |
Chatterbox model |
TRUE if model is loaded
Learned position embeddings module
learned_position_embeddings(seq_len, model_dim, init_std = 0.02)learned_position_embeddings(seq_len, model_dim, init_std = 0.02)
seq_len |
Maximum sequence length |
model_dim |
Embedding dimension |
init_std |
Initialization standard deviation |
nn_module
Projects input to model dimension with layer norm and positional encoding.
linear_no_subsampling(input_dim = 512, output_dim = 512, dropout_rate = 0.1)linear_no_subsampling(input_dim = 512, output_dim = 512, dropout_rate = 0.1)
input_dim |
Input dimension |
output_dim |
Output dimension |
dropout_rate |
Dropout rate |
nn_module
Llama attention module
llama_attention(config, layer_idx)llama_attention(config, layer_idx)
config |
Model configuration |
layer_idx |
Layer index |
nn_module
Create Llama 520M configuration
llama_config_520m()llama_config_520m()
List with model configuration
Llama decoder layer
llama_decoder_layer(config, layer_idx)llama_decoder_layer(config, layer_idx)
config |
Model configuration |
layer_idx |
Layer index |
nn_module
Llama MLP module
llama_mlp(config)llama_mlp(config)
config |
Model configuration |
nn_module
Llama model (decoder only)
llama_model(config = NULL)llama_model(config = NULL)
config |
Model configuration (default: 520M) |
nn_module
RMS Normalization module
llama_rms_norm(hidden_size, eps = 1e-05)llama_rms_norm(hidden_size, eps = 1e-05)
|
Dimension to normalize |
|
eps |
Epsilon for numerical stability |
nn_module
Load pretrained weights for all model components.
Requires prior download via download_chatterbox_models.
Idempotent: an already-loaded model is returned unchanged, so
chatterbox(load = TRUE) followed by a stray
load_chatterbox() does not reload.
load_chatterbox(model)load_chatterbox(model)
model |
Chatterbox model object |
Chatterbox model with loaded weights
## Not run: model <- chatterbox("cuda", load = FALSE) model <- load_chatterbox(model) ## End(Not run)## Not run: model <- chatterbox("cuda", load = FALSE) model <- load_chatterbox(model) ## End(Not run)
Loads the turbo variant (GPT-2 backbone, MeanFlow decoder).
Requires prior download via download_chatterbox_turbo_models.
load_chatterbox_turbo(model)load_chatterbox_turbo(model)
model |
Chatterbox model object (with turbo=TRUE) |
Chatterbox model with loaded weights
## Not run: model <- chatterbox("cuda", turbo = TRUE, load = FALSE) model <- load_chatterbox_turbo(model) ## End(Not run)## Not run: model <- chatterbox("cuda", turbo = TRUE, load = FALSE) model <- load_chatterbox_turbo(model) ## End(Not run)
Load Conformer Encoder weights
load_conformer_encoder_weights(model, state_dict, prefix = "flow.encoder.")load_conformer_encoder_weights(model, state_dict, prefix = "flow.encoder.")
model |
Conformer encoder module |
state_dict |
State dictionary |
prefix |
Key prefix (e.g., "flow.encoder.") |
The model module, with weights copied in from
state_dict.
Load weights from safetensors into Llama model
load_llama_weights(model, state_dict, prefix = "model.")load_llama_weights(model, state_dict, prefix = "model.")
model |
LlamaModel instance |
state_dict |
Named list of tensors from safetensors |
prefix |
Prefix to strip from weight names (default: "model.") |
Model with loaded weights
Load T3 turbo weights from safetensors
load_t3_turbo_weights(model, state_dict)load_t3_turbo_weights(model, state_dict)
model |
T3 turbo model |
state_dict |
Named list of tensors |
Model with loaded weights
Load T3 weights from safetensors
load_t3_weights(model, state_dict)load_t3_weights(model, state_dict)
model |
T3 model |
state_dict |
Named list of tensors |
Model with loaded weights
Load tokenizer from JSON file (internal)
load_tokenizer(vocab_path)load_tokenizer(vocab_path)
vocab_path |
Path to tokenizer.json |
Tokenizer object (list)
Load a voice embedding from disk
load_voice_embedding(path, device = "cpu")load_voice_embedding(path, device = "cpu")
path |
File written by |
device |
Device to load tensors to (default "cpu"; use the model's device, e.g. "cuda", for generation) |
A voice_embedding object
## Not run: voice <- load_voice_embedding("narrator.voice", device = "cuda") model <- chatterbox("cuda") res <- generate(model, "Loaded a saved voice.", voice) ## End(Not run)## Not run: voice <- load_voice_embedding("narrator.voice", device = "cuda") model <- chatterbox("cuda") res <- generate(model, "Loaded a saved voice.", voice) ## End(Not run)
Load voice encoder weights from safetensors
load_voice_encoder_weights(model, state_dict)load_voice_encoder_weights(model, state_dict)
model |
Voice encoder model |
state_dict |
Named list of tensors |
Model with loaded weights
Create non-padding mask
make_non_pad_mask_s3(lengths, max_len)make_non_pad_mask_s3(lengths, max_len)
lengths |
Tensor of sequence lengths |
max_len |
Maximum sequence length |
Boolean mask tensor (TRUE for valid positions)
Create padding mask
make_pad_mask(lengths, max_len = NULL)make_pad_mask(lengths, max_len = NULL)
lengths |
Sequence lengths |
max_len |
Maximum length |
Boolean mask (TRUE for padded positions)
Convert mask to attention bias
mask_to_bias(mask, dtype)mask_to_bias(mask, dtype)
mask |
Boolean mask |
dtype |
Target dtype |
Attention bias tensor
Mish activation
mish_activation()mish_activation()
nn_module
Check if Models are Downloaded
models_available()models_available()
TRUE if all model files exist locally
models_available()models_available()
Applies a constant gain so the signal measures target_lufs
integrated loudness. Mirrors Python chatterbox turbo's
norm_loudness(): when the gain is non-finite or non-positive
(e.g. silence), the input is returned unchanged.
normalize_loudness(samples, sample_rate, target_lufs = -27)normalize_loudness(samples, sample_rate, target_lufs = -27)
samples |
Numeric vector of mono audio samples. |
sample_rate |
Sample rate in Hz. |
target_lufs |
Target integrated loudness (default -27, the Python turbo conditioning default). |
Gain-adjusted samples.
samples <- sin(2 * pi * 440 * seq(0, 1, length.out = 48000)) norm <- normalize_loudness(samples, 48000, target_lufs = -23)samples <- sin(2 * pi * 440 * seq(0, 1, length.out = 48000)) norm <- normalize_loudness(samples, 48000, target_lufs = -23)
The single normalization entry point. Applies, in order: the
R-specific internal-caps mitigation (normalize_internal_caps),
then punctuation normalization (punc_norm: whitespace collapse,
first-letter capitalization, uncommon-punctuation rewrite, trailing
period). punc_norm is the Python-parity piece; the caps step is
R-only and can be turned off.
normalize_tts_text(text, caps = TRUE, punctuation = TRUE)normalize_tts_text(text, caps = TRUE, punctuation = TRUE)
text |
Character scalar. |
caps |
Apply the internal-caps mitigation. Default TRUE. |
punctuation |
Apply punctuation normalization. Default TRUE. |
Normalized text.
normalize_tts_text("hello world")normalize_tts_text("hello world")
Pad audio to multiple of token rate
pad_audio_for_tokenizer(wav, sr)pad_audio_for_tokenizer(wav, sr)
wav |
Audio samples |
sr |
Sample rate |
Padded audio
Perceiver resampler for conditioning compression
perceiver_resampler(num_query_tokens = 32, embed_dim = 1024, num_heads = 4)perceiver_resampler(num_query_tokens = 32, embed_dim = 1024, num_heads = 4)
num_query_tokens |
Number of query tokens (default 32) |
embed_dim |
Embedding dimension (default 1024) |
num_heads |
Number of attention heads (default 4) |
nn_module
Two-layer feed-forward network with SiLU activation.
positionwise_feedforward(n_feat = 512, n_ffn = 2048, dropout_rate = 0.1)positionwise_feedforward(n_feat = 512, n_ffn = 2048, dropout_rate = 0.1)
n_feat |
Input/output dimension |
n_ffn |
Hidden dimension |
dropout_rate |
Dropout rate |
nn_module
Two causal convolutions with residual connection for look-ahead.
pre_lookahead_layer(channels = 512, pre_lookahead_len = 3)pre_lookahead_layer(channels = 512, pre_lookahead_len = 3)
channels |
Number of channels |
pre_lookahead_len |
Look-ahead length (kernel size - 1 for conv1) |
nn_module
Precompute rotary position embedding frequencies
precompute_freqs_cis(dim, end, theta = 10000)precompute_freqs_cis(dim, end, theta = 10000)
dim |
Dimension (head_dim) |
end |
Maximum sequence length |
theta |
Base frequency |
Complex frequency tensor
Print method for chatterbox
## S3 method for class 'chatterbox' print(x, ...)## S3 method for class 'chatterbox' print(x, ...)
x |
Chatterbox model |
... |
Ignored |
x, invisibly. Called for the side effect of printing a
summary of the model to the console.
Print method for chatterbox_gc_options
## S3 method for class 'chatterbox_gc_options' print(x, ...)## S3 method for class 'chatterbox_gc_options' print(x, ...)
x |
Object from |
... |
Ignored |
x, invisibly
Print method for voice_embedding
## S3 method for class 'voice_embedding' print(x, ...)## S3 method for class 'voice_embedding' print(x, ...)
x |
Voice embedding |
... |
Ignored |
x, invisibly. Called for the side effect of printing the
embedding's shape and sample rate to the console.
Normalize punctuation for TTS
punc_norm(text)punc_norm(text)
text |
Input text |
Normalized text
Loads model if needed and generates speech. Convenient for quick tests.
quick_tts(text, reference_audio, output_path = NULL, device = "cpu", autocast = NULL, turbo = FALSE)quick_tts(text, reference_audio, output_path = NULL, device = "cpu", autocast = NULL, turbo = FALSE)
text |
Text to synthesize |
reference_audio |
Path to reference audio file |
output_path |
Optional output file path. If NULL, returns audio data. |
device |
Device to use |
autocast |
Use mixed precision (float16) on CUDA (default TRUE on CUDA) |
turbo |
Logical. Use turbo architecture. Default FALSE. |
The generate result list (audio, sample_rate,
...). When output_path is set the audio is also written there
(the list gains a path element) and the list is returned
invisibly so the audio vector does not print.
## Not run: quick_tts("Hello!", "reference_voice.wav", "out.wav") ## End(Not run)## Not run: quick_tts("Hello!", "reference_voice.wav", "out.wav") ## End(Not run)
Read audio file
read_audio(path)read_audio(path)
path |
Path to audio file (WAV or MP3 format) |
List with samples (numeric vector normalized to \[-1, 1\]) and sr (sample rate)
tmp <- file.path(tempdir(), "tone.wav") write_audio(sin(2 * pi * 440 * seq(0, 1, length.out = 24000)), 24000, tmp) a <- read_audio(tmp) str(a) # list(samples = ..., sr = ...)tmp <- file.path(tempdir(), "tone.wav") write_audio(sin(2 * pi * 440 * seq(0, 1, length.out = 24000)), 24000, tmp) a <- read_audio(tmp) str(a) # list(samples = ..., sr = ...)
Reflection padding for 1D (nn_reflection_pad1d equivalent)
reflection_pad1d(padding)reflection_pad1d(padding)
padding |
Integer vector c(left, right) for padding |
nn_module
Multi-head attention with relative positional encodings.
rel_position_attention(n_head = 8, n_feat = 512, dropout_rate = 0.1)rel_position_attention(n_head = 8, n_feat = 512, dropout_rate = 0.1)
n_head |
Number of attention heads |
n_feat |
Feature dimension |
dropout_rate |
Dropout rate |
nn_module
Resample audio
resample_audio(samples, from_sr, to_sr)resample_audio(samples, from_sr, to_sr)
samples |
Numeric vector of audio samples |
from_sr |
Source sample rate |
to_sr |
Target sample rate |
Resampled audio samples
## Not run: # Windowed-sinc resampling runs on torch, so it needs libtorch installed tone <- sin(2 * pi * 440 * seq(0, 1, length.out = 24000)) tone_16k <- resample_audio(tone, 24000, 16000) ## End(Not run)## Not run: # Windowed-sinc resampling runs on torch, so it needs libtorch installed tone <- sin(2 * pi * 440 * seq(0, 1, length.out = 24000)) tone_16k <- resample_audio(tone, 24000, 16000) ## End(Not run)
Rotate half of the tensor for RoPE
rotate_half(x)rotate_half(x)
x |
Input tensor |
Rotated tensor
S3 Audio Encoder V2
s3_audio_encoder(n_mels, n_state, n_head, n_layer, stride = 2L)s3_audio_encoder(n_mels, n_state, n_head, n_layer, stride = 2L)
n_mels |
Number of mel bins |
n_state |
Hidden dimension |
n_head |
Number of attention heads |
n_layer |
Number of transformer layers |
stride |
Convolution stride (default 2) |
nn_module
Compute log mel spectrogram for S3Tokenizer
s3_log_mel_spectrogram(audio, mel_filters, window, n_fft = 400, device = "cpu")s3_log_mel_spectrogram(audio, mel_filters, window, n_fft = 400, device = "cpu")
audio |
Audio tensor (batch, samples) |
mel_filters |
Pre-computed mel filterbank |
window |
Hann window |
n_fft |
FFT size (default 400) |
device |
Device |
Log mel spectrogram (batch, n_mels, time)
Multi-Head Attention base module
s3_multi_head_attention(n_state, n_head)s3_multi_head_attention(n_state, n_head)
n_state |
Hidden dimension |
n_head |
Number of heads |
nn_module
Residual attention block
s3_residual_attention_block(n_state, n_head, kernel_size = 31L)s3_residual_attention_block(n_state, n_head, kernel_size = 31L)
n_state |
Hidden dimension |
n_head |
Number of heads |
kernel_size |
FSMN kernel size |
nn_module
S3Tokenizer V2 module
s3_tokenizer(config = NULL)s3_tokenizer(config = NULL)
config |
Configuration list (default from s3_tokenizer_config()) |
nn_module
S3Tokenizer model configuration
s3_tokenizer_config(n_mels = 128, n_audio_state = 1280, n_audio_head = 20, n_audio_layer = 6, n_codebook_size = 6561)s3_tokenizer_config(n_mels = 128, n_audio_state = 1280, n_audio_head = 20, n_audio_layer = 6, n_codebook_size = 6561)
n_mels |
Number of mel bins (default 128) |
n_audio_state |
Hidden state dimension (default 1280) |
n_audio_head |
Number of attention heads (default 20) |
n_audio_layer |
Number of transformer layers (default 6) |
n_codebook_size |
Codebook size (default 6561 = 3^8) |
Configuration list
S3Gen Token to Waveform
s3gen(meanflow = FALSE)s3gen(meanflow = FALSE)
meanflow |
Logical. Use mean-flow formulation. Default FALSE. |
nn_module
Persists a prepared voice (the R analogue of Python 0.1.7's
Conditionals.save()) so it can be reused across sessions
without the reference audio or recomputation. Tensors are moved to
CPU before saving; the format is torch_save
(not compatible with Python's .pt conditionals).
save_voice_embedding(voice, path)save_voice_embedding(voice, path)
voice |
Voice embedding from |
path |
Output file path (suggested extension: .rds-like custom, e.g. "narrator.voice") |
path, invisibly
## Not run: model <- chatterbox("cuda") voice <- create_voice_embedding(model, "reference_voice.wav") save_voice_embedding(voice, file.path(tempdir(), "narrator.voice")) ## End(Not run)## Not run: model <- chatterbox("cuda") voice <- create_voice_embedding(model, "reference_voice.wav") save_voice_embedding(voice, file.path(tempdir(), "narrator.voice")) ## End(Not run)
Starts a blocking HTTP server that loads the chatterbox model once and answers
OpenAI-compatible TTS requests. Intended as a drop-in replacement for the
chatterbox TTS container: point an HTTP client (e.g. tts.api) at
http://<host>:<port> and it serves the same endpoints.
serve(port = 7810L, device = "cuda", voices_dir = NULL, turbo = FALSE, timeout = 300L, max_body = 10L * 1024L^2, warmup = TRUE)serve(port = 7810L, device = "cuda", voices_dir = NULL, turbo = FALSE, timeout = 300L, max_body = 10L * 1024L^2, warmup = TRUE)
port |
Integer. TCP port to listen on. Default 7810. |
device |
Character. Torch device for the model ("cuda", "cpu", "mps"). |
voices_dir |
Character. Directory of voice reference files. Defaults to
the |
turbo |
Logical. Serve the Chatterbox Turbo model. |
timeout |
Integer. Per-connection I/O timeout in seconds (guards against stalled clients). Default 300. |
max_body |
Integer. Maximum request body size in bytes. Default 10 MB. |
warmup |
Logical. Run one short synthesis at startup to trigger the one-time JIT tracing, so the first client request isn't slow. Default TRUE. |
Endpoints:
GET /health - liveness probe, returns {"status":"ok"}.
GET /v1/audio/voices - lists voice names in voices_dir.
POST /v1/audio/speech - body {input, voice,
response_format, exaggeration, cfg_weight, temperature}; returns the
synthesized audio bytes. voice is a voice-library name (resolved
against voices_dir) or a path to a reference audio file.
The server is single-threaded and runs until interrupted. Run it under a
process supervisor (systemd, a container CMD, tmux) for persistence. An
example systemd unit ships with the package:
system.file("chatterbox.service", package = "chatterbox").
Does not return normally; runs until interrupted.
## Not run: # OpenAI-compatible TTS server on port 7810 serve(port = 7810L, device = "cuda") ## End(Not run)## Not run: # OpenAI-compatible TTS server on port 7810 serve(port = 7810L, device = "cuda") ## End(Not run)
Generates sine waveforms from F0 for source-filter synthesis
sine_gen(sample_rate, harmonic_num = 0, sine_amp = 0.1, noise_std = 0.003, voiced_threshold = 0)sine_gen(sample_rate, harmonic_num = 0, sine_amp = 0.1, noise_std = 0.003, voiced_threshold = 0)
sample_rate |
Sampling rate in Hz |
harmonic_num |
Number of harmonics |
sine_amp |
Sine amplitude |
noise_std |
Noise standard deviation |
voiced_threshold |
F0 threshold for voiced/unvoiced |
nn_module
Sinusoidal positional embedding for timesteps
sinusoidal_pos_emb(dim = 320L)sinusoidal_pos_emb(dim = 320L)
dim |
Output dimension |
nn_module
Sine-based periodic activation: x + 1/a * sin^2(ax) Reference: https://arxiv.org/abs/2006.08195
snake_activation(in_features, alpha_trainable = TRUE, alpha_logscale = FALSE)snake_activation(in_features, alpha_trainable = TRUE, alpha_logscale = FALSE)
in_features |
Number of input channels |
alpha_trainable |
Whether alpha is trainable |
alpha_logscale |
Whether to use log scale for alpha |
nn_module
Source Module for Neural Source Filter
source_module_hn_nsf(sample_rate, upsample_scale, harmonic_num = 0, sine_amp = 0.1, add_noise_std = 0.003, voiced_threshold = 0)source_module_hn_nsf(sample_rate, upsample_scale, harmonic_num = 0, sine_amp = 0.1, add_noise_std = 0.003, voiced_threshold = 0)
sample_rate |
Sampling rate |
upsample_scale |
Upsampling factor |
harmonic_num |
Number of harmonics |
sine_amp |
Sine amplitude |
add_noise_std |
Noise std |
voiced_threshold |
Voiced threshold |
nn_module
Statistics pooling
statistics_pooling(x)statistics_pooling(x)
x |
Input tensor (batch, channels, time) |
Statistics tensor (batch, channels * 2)
Create T3 conditioning object
t3_cond(speaker_emb, cond_prompt_speech_tokens = NULL, cond_prompt_speech_emb = NULL, emotion_adv = 0.5)t3_cond(speaker_emb, cond_prompt_speech_tokens = NULL, cond_prompt_speech_emb = NULL, emotion_adv = 0.5)
speaker_emb |
Speaker embedding tensor (B, 256) |
cond_prompt_speech_tokens |
Optional speech tokens for conditioning |
cond_prompt_speech_emb |
Optional pre-computed speech embeddings |
emotion_adv |
Emotion/exaggeration control (0-1) |
List representing T3Cond
T3 conditioning encoder
t3_cond_enc(config = NULL)t3_cond_enc(config = NULL)
config |
T3 configuration |
nn_module
Move T3 conditioning to device
t3_cond_to_device(cond, device)t3_cond_to_device(cond, device)
cond |
T3 conditioning object |
device |
Target device |
T3 conditioning on device
Create T3 configuration (English-only)
t3_config_english()t3_config_english()
List with T3 configuration
Create T3 turbo configuration (GPT-2 backbone)
t3_config_turbo()t3_config_turbo()
List with T3 turbo configuration
Uses jit_trace for ~8x faster per-token inference.
t3_inference_traced(model, cond, text_tokens, max_new_tokens = 1000, temperature = 0.8, cfg_weight = 0.5, top_p = 1, min_p = 0.05, repetition_penalty = 1.2, max_cache_len = 350L)t3_inference_traced(model, cond, text_tokens, max_new_tokens = 1000, temperature = 0.8, cfg_weight = 0.5, top_p = 1, min_p = 0.05, repetition_penalty = 1.2, max_cache_len = 350L)
model |
T3 model |
cond |
Conditioning object |
text_tokens |
Text token tensor |
max_new_tokens |
Maximum tokens to generate |
temperature |
Sampling temperature |
cfg_weight |
Classifier-free guidance weight |
top_p |
Top-p sampling threshold |
min_p |
Min-p filtering threshold |
repetition_penalty |
Repetition penalty |
max_cache_len |
Maximum KV cache length |
Generated speech token tensor
T3 Token-to-Token TTS model
t3_model(config = NULL)t3_model(config = NULL)
config |
T3 configuration |
nn_module
T3 Token-to-Token TTS model (Turbo variant with GPT-2 backbone)
t3_model_turbo(config = NULL)t3_model_turbo(config = NULL)
config |
T3 turbo configuration |
nn_module
TDNN Layer
tdnn_layer(in_channels, out_channels, kernel_size, stride = 1, dilation = 1, padding = NULL)tdnn_layer(in_channels, out_channels, kernel_size, stride = 1, dilation = 1, padding = NULL)
in_channels |
Input channels |
out_channels |
Output channels |
kernel_size |
Kernel size |
stride |
Stride |
dilation |
Dilation |
padding |
Padding (default: computed from kernel_size and dilation) |
nn_module
Timestep embedding MLP
timestep_embedding(in_channels = 320L, time_embed_dim = 1024L)timestep_embedding(in_channels = 320L, time_embed_dim = 1024L)
in_channels |
Input channels |
time_embed_dim |
Output dimension |
nn_module
Mirrors the HF tokenizers pipeline used by the Python reference: added tokens ([SPACE], [laughter], [sigh], ...) are extracted first and map directly to their ids; BPE merges run on the remaining text, in priority order (first merge = highest priority).
tokenize_text(tokenizer, text)tokenize_text(tokenizer, text)
tokenizer |
Tokenizer object |
text |
Input text |
Integer vector of token IDs
This module is designed to be traced with jit_trace. It uses: - Pre-allocated KV cache of fixed max size - Attention mask to indicate valid cache positions - Returns only output tensor (no lists/dicts)
traceable_attention(attn, max_cache_len = 300L)traceable_attention(attn, max_cache_len = 300L)
attn |
Original llama_attention module |
max_cache_len |
Maximum cache length |
nn_module
Traceable decoder layer with pre-allocated KV cache
traceable_decoder_layer(layer, max_cache_len = 300L)traceable_decoder_layer(layer, max_cache_len = 300L)
layer |
Original llama_decoder_layer |
max_cache_len |
Maximum cache length |
nn_module
Computes K and V projections with RoPE for a single layer. Returns concatenated K and V for easy unpacking.
traceable_kv_projector(layer)traceable_kv_projector(layer)
layer |
Original llama_decoder_layer |
nn_module
This wraps the full Llama model for traced cached inference. Uses pre-allocated KV cache for all layers.
traceable_transformer_cached(tfmr, max_cache_len = 300L)traceable_transformer_cached(tfmr, max_cache_len = 300L)
tfmr |
Original llama_model |
max_cache_len |
Maximum cache length |
nn_module
Traceable transformer for first token (no cache)
traceable_transformer_first(tfmr)traceable_transformer_first(tfmr)
tfmr |
Original llama_model |
nn_module
Transit layer (channel reduction)
transit_layer(in_channels, out_channels, bias = FALSE)transit_layer(in_channels, out_channels, bias = FALSE)
in_channels |
Input channels |
out_channels |
Output channels |
bias |
Whether to use bias (default FALSE) |
nn_module
Transpose layer for use in sequential
nn_module
Splits at sentence boundaries (oversized sentences subdivided at commas,
then word-split as a last resort), resolves the voice once, and runs T3
on every chunk first so batching uses ACTUAL speech-token lengths rather
than a character estimate. Chunks are then bucketed by their real length
and synthesized within a per-card batch cap (sized from VRAM): a group
of one takes the fast traced-CFM path, a group of several runs as one
eager batched S3Gen solve. Audio is stitched in original order; garbage
is collected at each batch boundary (see
chatterbox_gc_options). Turbo has no batched path and is
synthesized serially.
tts_chunked(model, text, voice, chunk_size = 200, max_batch = NULL, ...)tts_chunked(model, text, voice, chunk_size = 200, max_batch = NULL, ...)
model |
Chatterbox model |
text |
Text to synthesize |
voice |
Voice embedding or path to reference audio (resolved once) |
chunk_size |
Maximum characters per chunk (default 200) |
max_batch |
Maximum chunks per batched solve. Default NULL: sized per card from VRAM. Set an integer to override. |
... |
Synthesis arguments forwarded to the T3 and S3Gen stages, as
in |
List with audio and sample_rate
## Not run: model <- chatterbox("cuda") res <- tts_chunked(model, long_text, "reference_voice.wav") write_audio(res$audio, res$sample_rate, "long.wav") ## End(Not run)## Not run: model <- chatterbox("cuda") res <- tts_chunked(model, long_text, "reference_voice.wav") write_audio(res$audio, res$sample_rate, "long.wav") ## End(Not run)
Thin convenience wrapper over generate with
output_path set, kept for the file-summary return shape. New
code can call generate(..., output_path = path) directly.
tts_to_file(model, text, voice, output_path, ...)tts_to_file(model, text, voice, output_path, ...)
model |
Chatterbox model |
text |
Text to synthesize |
voice |
Voice embedding or path to reference audio |
output_path |
Output file path (WAV format) |
... |
Additional arguments passed to generate() |
Invisibly returns a list with elements: path,
eos_found, n_tokens, audio_sec. When iterating
over many texts, collect these into a data.frame to identify which
inputs failed (eos_found = FALSE) and need reprocessing.
## Not run: model <- chatterbox("cuda") tts_to_file(model, "Hello world!", "reference_voice.wav", "out.wav") ## End(Not run)## Not run: model <- chatterbox("cuda") tts_to_file(model, "Hello world!", "reference_voice.wav", "out.wav") ## End(Not run)
Check if Turbo Models are Downloaded
turbo_models_available()turbo_models_available()
TRUE if all turbo model files exist locally
turbo_models_available()turbo_models_available()
Update KV cache with new K/V values
update_kv_cache(cache, layer_idx, new_k, new_v, position)update_kv_cache(cache, layer_idx, new_k, new_v, position)
cache |
Cache list from create_kv_cache |
layer_idx |
Layer index (1-indexed) |
new_k |
New key tensor (batch, heads, 1, head_dim) |
new_v |
New value tensor (batch, heads, 1, head_dim) |
position |
Current position (0-indexed) |
The cache list, invisibly, with the new K/V written in
place at position.
Update valid mask to include new position
update_valid_mask(cache, position)update_valid_mask(cache, position)
cache |
Cache list from create_kv_cache |
position |
Current position (0-indexed) |
The cache list, invisibly, with position marked
valid in the attention mask.
2x upsampling using interpolation + convolution.
upsample_1d(channels = 512, stride = 2L)upsample_1d(channels = 512, stride = 2L)
channels |
Number of channels |
stride |
Upsample factor |
nn_module
Upsample Conformer Encoder
upsample_conformer_encoder(input_size = 512, output_size = 512, num_blocks = 6)upsample_conformer_encoder(input_size = 512, output_size = 512, num_blocks = 6)
input_size |
Input dimension |
output_size |
Output dimension |
num_blocks |
Number of conformer blocks |
nn_module
Full conformer encoder matching Python UpsampleConformerEncoder.
upsample_conformer_encoder_full(input_size = 512, output_size = 512, num_blocks = 6, num_up_blocks = 4, n_head = 8, n_ffn = 2048, dropout_rate = 0.1, pre_lookahead_len = 3)upsample_conformer_encoder_full(input_size = 512, output_size = 512, num_blocks = 6, num_up_blocks = 4, n_head = 8, n_ffn = 2048, dropout_rate = 0.1, pre_lookahead_len = 3)
input_size |
Input dimension |
output_size |
Output dimension |
num_blocks |
Number of conformer blocks before upsample |
num_up_blocks |
Number of conformer blocks after upsample |
n_head |
Number of attention heads |
n_ffn |
Feed-forward hidden dimension |
dropout_rate |
Dropout rate |
pre_lookahead_len |
Look-ahead length |
nn_module
Re-synthesizes audio so the same words and prosody come out in
the target voice (Python chatterbox's ChatterboxVC). No text
or T3 generation is involved: the source speech is tokenized directly
(25 tokens/s) and S3Gen renders the tokens with the target speaker's
conditioning, so the result follows the source's timing.
voice_convert(model, audio, voice, sample_rate = NULL)voice_convert(model, audio, voice, sample_rate = NULL)
model |
Loaded chatterbox model (standard, not turbo) |
audio |
Source speech (file path, numeric vector, or torch tensor) |
voice |
Target voice: a voice_embedding from
|
sample_rate |
Sample rate of |
List with audio (numeric vector), sample_rate
(24000), and audio_sec, like generate
## Not run: model <- chatterbox("cuda") res <- voice_convert(model, "source_speech.wav", "target_voice.wav") write_audio(res$audio, res$sample_rate, "converted.wav") ## End(Not run)## Not run: model <- chatterbox("cuda") res <- voice_convert(model, "source_speech.wav", "target_voice.wav") write_audio(res$audio, res$sample_rate, "converted.wav") ## End(Not run)
Voice encoder module
voice_encoder(config = NULL)voice_encoder(config = NULL)
config |
Voice encoder configuration |
nn_module
Voice encoder configuration
voice_encoder_config()voice_encoder_config()
List with configuration parameters
Write audio file
write_audio(samples, sr, path)write_audio(samples, sr, path)
samples |
Numeric vector of audio samples (normalized to \[-1, 1\]) |
sr |
Sample rate |
path |
Output path (WAV format) |
The output path, invisibly. Called for the side effect
of writing a WAV file.
tmp <- file.path(tempdir(), "tone.wav") write_audio(sin(2 * pi * 440 * seq(0, 1, length.out = 24000)), 24000, tmp)tmp <- file.path(tempdir(), "tone.wav") write_audio(sin(2 * pi * 440 * seq(0, 1, length.out = 24000)), 24000, tmp)