Changes in version 0.2.1 (2026-06-25) - serve() resolves a voice-library name (e.g. "Barry") against voices_dir before treating the voice field as a path. Previously a like-named file or directory in the server's working directory shadowed the library voice, so /v1/audio/speech returned a 500 ("cannot open the connection"). A path is now accepted only when it is a regular file. Changes in version 0.2.0 First CRAN release. Gathers the 0.1.0.1 - 0.1.0.16 development series: a complete pure-R port of Chatterbox TTS (no Python, no compiled code), voice cloning, long-form chunked synthesis, an OpenAI-compatible serve(), a TorchScript (jit) decode backend at container speed, and automatic CUDA GC tuning. Per-change detail for the series is below. Changes in version 0.1.0.16 - chatterbox() gains a tune_gc argument (default TRUE) to opt out of the CUDA GC tuning added in 0.1.0.15. The tuning is a deliberate, persistent options() side effect (torch reads the allocator rates later, at CUDA init), documented in ?chatterbox; pass tune_gc = FALSE to skip it. No behavior change at the default. Changes in version 0.1.0.15 - chatterbox() now tunes torch's CUDA garbage-collection rates before the first CUDA op. torch reads torch.cuda_allocator_reserved_rate (and torch.threshold_call_gc) once at lazy CUDA init; the 0.2 default floor meant gc ran on nearly every allocation once a model occupied more than 20% of VRAM, which was 53% of inference wall time. The floor is now the model's footprint as a fraction of VRAM (4.1GB regular, 3.6GB turbo): e.g. a 16GB card gets 0.26 / 0.23, a 6GB card 0.68 / 0.60. threshold_call_gc is raised to 16000 MB. All set ahead of cuda_is_available(). Turbo is ~2x faster on a 16GB card (10.7s -> 5.3s for a 16s utterance). An explicit user-set option still wins. See torch's memory-management vignette. Changes in version 0.1.0.14 - read_audio() now detects the audio container from the file's magic bytes (RIFF/WAVE, ID3, MP3 frame sync) instead of trusting the extension. A reference saved as PCM/WAV but named .mp3 (or vice versa) previously ran the wrong decoder and produced NaN garbage, silently corrupting voice cloning; it now decodes correctly. Changes in version 0.1.0.13 - serve() now caches each voice embedding (by reference path + mtime) and reuses it across requests, instead of re-encoding the reference on every /v1/audio/speech call. Per-request re-encoding churned voice GPU tensors and raced the CUDA caching allocator, intermittently producing NaN speaker conditioning - seen as a "missing value where TRUE/FALSE needed" 500 and as degraded voice cloning (~33-50% of requests on both an RTX 5060 Ti and a GTX 1660 Ti; 0 with the cache). trim_silence() now raises a clear error instead of the cryptic one if NaN audio ever reaches it. Changes in version 0.1.0.12 - serve() now uses the jit backend for turbo as well as standard (was eager "r" for turbo, written before the turbo jit decode step existed). A turbo serve now runs the fast GPT-2 jit decode (~8x faster per token). Changes in version 0.1.0.11 - Turbo's GPT-2 tokenizer now emits the paralinguistic/emotion tags ([sigh], [laugh], [whispering], [cough], ...) as single special tokens. load_gpt2_tokenizer() builds an added-token split-list and tokenize_text_gpt2() splits on it before BPE; previously the tags were byte-BPE'd into [, sigh, ] and never rendered. Changes in version 0.1.0.10 - New t3_inference_turbo_jit(): a TorchScript decode step for turbo's GPT-2 backbone, selected by generate(turbo, backend = "jit"). ~8x faster per token than the eager turbo path (the turbo counterpart of t3_inference_jit). - Fixed turbo correctness (it was producing nonsense): the HF GPT-2 Conv1D projection weights are now transposed for the nn_linear reimplementation (non-square ones were failing to load -> random weights), and gpt2_model$forward now adds the wpe absolute position embeddings that HF GPT2Model applies. With jit, turbo is ~1.6x faster than the standard model at comparable VRAM. Changes in version 0.1.0.9 - chatterbox() now constructs and loads the model by default (one call, like Python from_pretrained). Pass load = FALSE for the bare object. Mildly breaking: code that used chatterbox() as a cheap constructor before a separate load_chatterbox() now needs load = FALSE (or relies on load_chatterbox() being idempotent). - load_chatterbox() / load_chatterbox_turbo() are idempotent: an already-loaded model is returned unchanged. - generate(output_path = ) also writes the audio to a WAV and adds a path element; tts_to_file() is now a thin wrapper over it. - generate() defaults normalize_text = FALSE. The internal-caps mitigation patched a since-fixed (column-major/STFT) bug and was flattening intended emphasis; punctuation normalization still always runs. normalize_tts_text(caps =, punctuation =) is the single entry. - generate() now errors clearly when the input exceeds the T3 text-token limit instead of crashing, and sizes the traced CFM from the actual generated token count (no text-length guessing). - tts_chunked() is the long-form layer: word-safe splitting, voice resolved once, and T3 run first so batching and the per-card memory cap use actual speech-token lengths. - serve() routes synthesis through tts_chunked() (long-text splitting - per-card batching) and forwards more request knobs. Changes in version 0.1.0.8 - New generate_batch(): several texts, one batched S3Gen synthesis pass; padded rows validated to match single runs (mel diff <= 0.005). - s3gen$inference() accepts ragged batches via speech_token_lens. Changes in version 0.1.0.7 - New voice_convert(): speech-to-speech voice conversion (port of Python ChatterboxVC); re-renders source speech in a target voice, preserving the source timing. Changes in version 0.1.0.6 - generate(skip_vocoder = TRUE) returns the mel spectrogram instead of audio (Python 0.1.7 parity). - New save_voice_embedding()/load_voice_embedding(): torch_save-based voice presets, reusable across sessions without the reference audio. Changes in version 0.1.0.5 - New integrated_loudness() and normalize_loudness() (ITU-R BS.1770-4, pure base R, matches pyloudnorm to 6 decimals); create_voice_embedding() gains norm_loudness, defaulting to TRUE for turbo models (Python parity). - read_audio() downmixes stereo files by channel mean (librosa parity); previously the right channel was silently dropped. - Parity reference retargeted to chatterbox-tts 0.1.7. Changes in version 0.1.0.4 - chatterbox_gc_options() now returns a classed list of the recommended options() values (apply with do.call(options, ...) before torch loads); the printed advice moved to its print method. Changes in version 0.1.0.3 C++ apparatus retired in favor of a TorchScript backend (June 2026) - New backend = "jit": each token's 30-layer forward runs as one TorchScript function (torch::jit_compile, compiled per session in milliseconds). 11 ms/token long-form with tuned GC settings, within ~20% of the C++ backend it replaces, auto-sized KV cache, no compiled code. - Deleted src/, configure, and cleanup: the C++ backend linked against the torch package's private libtorch, which broke on install order, was dead in CRAN-built binaries, and could go stale on torch upgrades. chatterbox is now a pure-R package. - Measured dispatch attribution (see the performance vignette): even eager R written directly against ATen builtins keeps a ~70 ms/token floor; the per-op R call is the cost, not wrapper style. Container parity for long-form (June 2026) - The CFM estimator's attention uses the fused SDPA kernel: the mel stage runs 2.5x faster and stops triggering GC storms at long sequence lengths. - The fast backend auto-sizes its KV cache, so generations of any length complete; with tuned GC settings, long-form native generation runs at container speed (0.30 vs 0.29 wall-seconds per audio-second). (Measured on the C++ backend, since replaced by backend = "jit", which inherits the auto-sized cache.) - generate() gains max_new_tokens and max_cache_len. - tts_chunked() actually enforces chunk_size now (it was dead code): run-on sentences split at comma boundaries. GC tuning and performance (June 2026) - With torch's default allocator settings, inference is garbage-collection-bound: ~91% of pure-R generation wall time is R GC. One option fixes it: torch.cuda_allocator_reserved_rate set above the model's reserved fraction of the card (~10x pure-R speedup, ~15x for the compiled-loop backend). New chatterbox_gc_options() prints the snippet for your GPU; the performance vignette has the full attribution table. - The compiled-loop backend measured fastest native under tuned GC (19-28 ms/token short-form; that C++ backend has since been replaced by backend = "jit" at ~11 ms/token long-form). Repetition penalty vectorized on-device. - tts_chunked() collects garbage once per chunk, bounding dead tensor handles (and VRAM creep) at one utterance's worth. - Performance vignette rewritten around these findings, with a hardware-scope caveat: numbers are from one GPU; the mechanism generalizes, the magnitudes may not. Changes in version 0.1.0.1 Fidelity review vs chatterbox-tts 0.1.4 (June 2026) Full top-to-bottom comparison against the Python reference; thanks to @chris-english for the bug reports that prompted it (#1, #2, #5). Text front end - generate() now applies punc_norm() unconditionally like the Python reference (whitespace collapse, first-letter capitalization, punctuation rewrites, trailing period). The missing trailing period was a major cause of missed end-of-speech (#1). - Paralinguistic tokens ([laughter], [sigh], [whisper], ...) now tokenize atomically instead of being spelled out letter by letter (#5). - Fixed BPE corruption for inputs that fully merge to one token. Sampling - Repetition penalty is sign-dependent (HF semantics) in all backends; the old divide-only form rewarded repeats with negative logits (#1). - top_p defaults to 1.0 (disabled) like Python; min_p and repetition_penalty are now actually forwarded to the standard model. - Degenerate-loop guard: the same token sampled 10x in a row stops generation with a warning and eos_found = FALSE. Conditioning - Windowed-sinc resampler and Kaldi fbank ports (validated against torchaudio to < 1e-8); the speaker encoder now sees the features it was trained on. - Reference audio capped at 10 s (S3Gen) / 6 s (tokenizer prompt), as upstream; voice encoder trims silence and uses Resemble's windowing. - Prompt mel/token alignment fixed for references that are not a multiple of 40 ms. Other - CFG unconditional branch, double-BOS prefill, exact GELU, fp32 default (autocast now opt-in), CUDA/MPS availability fallback, batch-safe pad masks, Python-parity SOS/EOS token stripping. - conds.pt no longer downloaded (unused by the R API).