serve() resolves a voice-library name (e.g. "Barry") against
voices_dir before treating the voice field as a path. Previously a
like-named file or directory in the server's working directory shadowed
the library voice, so /v1/audio/speech returned a 500 ("cannot open
the connection"). A path is now accepted only when it is a regular file.First CRAN release. Gathers the 0.1.0.1 - 0.1.0.16 development series:
a complete pure-R port of Chatterbox TTS (no Python, no compiled code),
voice cloning, long-form chunked synthesis, an OpenAI-compatible
serve(), a TorchScript (jit) decode backend at container speed, and
automatic CUDA GC tuning. Per-change detail for the series is below.
chatterbox() gains a tune_gc argument (default TRUE) to opt out of
the CUDA GC tuning added in 0.1.0.15. The tuning is a deliberate,
persistent options() side effect (torch reads the allocator rates
later, at CUDA init), documented in ?chatterbox; pass
tune_gc = FALSE to skip it. No behavior change at the default.chatterbox() now tunes torch's CUDA garbage-collection rates before the
first CUDA op. torch reads torch.cuda_allocator_reserved_rate (and
torch.threshold_call_gc) once at lazy CUDA init; the 0.2 default floor
meant gc ran on nearly every allocation once a model occupied more than
20% of VRAM, which was 53% of inference wall time. The floor is now the
model's footprint as a fraction of VRAM (4.1GB regular, 3.6GB turbo): e.g. a
16GB card gets 0.26 / 0.23, a 6GB card 0.68 / 0.60. threshold_call_gc is
raised to 16000 MB. All set ahead of cuda_is_available(). Turbo is ~2x
faster on a 16GB card (10.7s -> 5.3s for a 16s utterance). An explicit
user-set option still wins. See torch's memory-management vignette.read_audio() now detects the audio container from the file's magic
bytes (RIFF/WAVE, ID3, MP3 frame sync) instead of trusting the
extension. A reference saved as PCM/WAV but named .mp3 (or vice
versa) previously ran the wrong decoder and produced NaN garbage,
silently corrupting voice cloning; it now decodes correctly.serve() now caches each voice embedding (by reference path + mtime)
and reuses it across requests, instead of re-encoding the reference on
every /v1/audio/speech call. Per-request re-encoding churned voice
GPU tensors and raced the CUDA caching allocator, intermittently
producing NaN speaker conditioning - seen as a "missing value where
TRUE/FALSE needed" 500 and as degraded voice cloning (~33-50% of
requests on both an RTX 5060 Ti and a GTX 1660 Ti; 0 with the cache).
trim_silence() now raises a clear error instead of the cryptic one if
NaN audio ever reaches it.serve() now uses the jit backend for turbo as well as standard (was
eager "r" for turbo, written before the turbo jit decode step
existed). A turbo serve now runs the fast GPT-2 jit decode (~8x faster
per token).[sigh], [laugh], [whispering], [cough], ...) as single special
tokens. load_gpt2_tokenizer() builds an added-token split-list and
tokenize_text_gpt2() splits on it before BPE; previously the tags were
byte-BPE'd into [, sigh, ] and never rendered.t3_inference_turbo_jit(): a TorchScript decode step for turbo's
GPT-2 backbone, selected by generate(turbo, backend = "jit"). ~8x
faster per token than the eager turbo path (the turbo counterpart of
t3_inference_jit).nn_linear
reimplementation (non-square ones were failing to load -> random
weights), and gpt2_model$forward now adds the wpe absolute position
embeddings that HF GPT2Model applies. With jit, turbo is ~1.6x faster
than the standard model at comparable VRAM.chatterbox() now constructs and loads the model by default (one
call, like Python from_pretrained). Pass load = FALSE for the bare
object. Mildly breaking: code that used chatterbox() as a cheap
constructor before a separate load_chatterbox() now needs
load = FALSE (or relies on load_chatterbox() being idempotent).load_chatterbox() / load_chatterbox_turbo() are idempotent: an
already-loaded model is returned unchanged.generate(output_path = ) also writes the audio to a WAV and adds a
path element; tts_to_file() is now a thin wrapper over it.generate() defaults normalize_text = FALSE. The internal-caps
mitigation patched a since-fixed (column-major/STFT) bug and was
flattening intended emphasis; punctuation normalization still always
runs. normalize_tts_text(caps =, punctuation =) is the single entry.generate() now errors clearly when the input exceeds the T3
text-token limit instead of crashing, and sizes the traced CFM from the
actual generated token count (no text-length guessing).tts_chunked() is the long-form layer: word-safe splitting, voice
resolved once, and T3 run first so batching and the per-card memory cap
use actual speech-token lengths.serve() routes synthesis through tts_chunked() (long-text splitting
generate_batch(): several texts, one batched S3Gen synthesis
pass; padded rows validated to match single runs (mel diff <= 0.005).s3gen$inference() accepts ragged batches via speech_token_lens.voice_convert(): speech-to-speech voice conversion (port of
Python ChatterboxVC); re-renders source speech in a target voice,
preserving the source timing.generate(skip_vocoder = TRUE) returns the mel spectrogram instead of
audio (Python 0.1.7 parity).save_voice_embedding()/load_voice_embedding(): torch_save-based
voice presets, reusable across sessions without the reference audio.integrated_loudness() and normalize_loudness() (ITU-R
BS.1770-4, pure base R, matches pyloudnorm to 6 decimals);
create_voice_embedding() gains norm_loudness, defaulting to TRUE
for turbo models (Python parity).read_audio() downmixes stereo files by channel mean (librosa
parity); previously the right channel was silently dropped.chatterbox_gc_options() now returns a classed list of the
recommended options() values (apply with do.call(options, ...)
before torch loads); the printed advice moved to its print method.backend = "jit": each token's 30-layer forward runs as one
TorchScript function (torch::jit_compile, compiled per session in
milliseconds). 11 ms/token long-form with tuned GC settings, within
~20% of the C++ backend it replaces, auto-sized KV cache, no
compiled code.src/, configure, and cleanup: the C++ backend linked
against the torch package's private libtorch, which broke on install
order, was dead in CRAN-built binaries, and could go stale on torch
upgrades. chatterbox is now a pure-R package.backend = "jit",
which inherits the auto-sized cache.)generate() gains max_new_tokens and max_cache_len.tts_chunked() actually enforces chunk_size now (it was dead
code): run-on sentences split at comma boundaries.torch.cuda_allocator_reserved_rate set above
the model's reserved fraction of the card (~10x pure-R speedup, ~15x
for the compiled-loop backend). New chatterbox_gc_options() prints the snippet
for your GPU; the performance vignette has the full attribution table.backend = "jit" at ~11 ms/token long-form). Repetition penalty
vectorized on-device.tts_chunked() collects garbage once per chunk, bounding dead tensor
handles (and VRAM creep) at one utterance's worth.Full top-to-bottom comparison against the Python reference; thanks to @chris-english for the bug reports that prompted it (#1, #2, #5).
generate() now applies punc_norm() unconditionally like the Python
reference (whitespace collapse, first-letter capitalization,
punctuation rewrites, trailing period). The missing trailing period
was a major cause of missed end-of-speech (#1).[laughter], [sigh], [whisper], ...) now
tokenize atomically instead of being spelled out letter by letter (#5).top_p defaults to 1.0 (disabled) like Python; min_p and
repetition_penalty are now actually forwarded to the standard model.eos_found = FALSE.conds.pt no longer downloaded (unused by the R API).